Fake It 'Til You Make It

•Download as PPTX, PDF•

0 likes•109 views

Presented at Monitorama 2016 in Portland, OR. This presentation is about using statistical information to simulate high volume data traffic during product development.

Technology

Fake It ’Til You Make It
A case study in generating more data than your systems
John Stanford -- @jxstanford

• Structure
• 1 controller
• 6 compute nodes
• Usage
• Training classes
• R & D
• Pipeline
• 7 Heka log shippers
• 1 Heka log collector
• 1 Elasticsearch node
shipper
shipper
shipper
shipper
shipper
shipper
shipper
collector Elasticsearch
The Environment

shipper
shipper
shipper
shipper
shipper
shipper
shipper
collector Elasticsearch
shipper
shipper
shipper
shipper
shipper
shipper
shipper
collector Elasticsearch
Elasticsearch
Elasticsearch
Elasticsearch
Elasticsearch
Elasticsearch
Elasticsearch
collector
collector
collector
collector
collector
collector
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
¯_(ツ)_
/¯

• 21 – 31K per hour
• An obvious pattern
• A dead spot
• A couple spikes
7 Days

By Server
• 96% from one node
• 1% per remaining node
• A couple not reporting

By Component
• 60%
• 25%
• 15%
• 14%
• 1%

• Roughly 25K messages/hour
• Controllers are 100x noisier than compute nodes
• Swift generates 60% of traffic
• 99.9% of the time, there were less than 65 messages/sec
• There are some traffic spikes of ~1600 messages/sec
• 95% of messages were less than 240 bytes
• Largest message was 782 bytes

Model parameters
• Flow rate
• Hosts
• Components
• Size
• Content
• ?

The Scientist and Engineer's Guide to Digital Signal Processing - Steven W. Smith, Ph.D.

shipper
shipper
shipper
shipper
shipper
shipper
shipper
collector Elasticsearch

[flood_1ms]
ip_address = “127.0.0.1:9997”
sender = “tcp”
pprof_file = “”
encoder = “protobuf”
num_messages = 0
message_interval = “1ms”
max_message_size = 2800
variable_message_size = true
late_bind_timestamp = true
• Add flood process
• Monitor everything
• Repeat until it breaks
The plan

heka-flood
Seems like a good idea
• The good:
• Control over message rate
• The not so good:
• Not enough control over message content
• Timestamps assigned at initialization (pull request forthcoming)
• Couldn’t get variable message sizing to work
• Here comes the meatloaf…

Tactical
• System sustained 4000 x ~1k msgs/sec
• Collector and/or ES started to pause above that
• No messages were dropped
Strategic
• Load tool was underqualified
• Monitoring tool resolution is important for interpretation
• I should prepare for presentations further in advance
What did we learn?

Intuitions
• 4MB/sec is:
• 17K messages of 240 bytes
• 240 bytes is 95th percentile
• 64 msgs/sec is:
• 99.9th percentile
• 265 controllers
• 1060 compute
• 1:100 ratio
Our First Model

• Find the bottlenecks
• System resources?
• Elasticsearch?
• Collector?
• Load generator?
• Improve the model
• Probability distributions
• Noise
• Real message samples
• Off host load
• Regime changes
• Real world feedback
Next Steps

- Lifeguard is a set of optimizations to the SWIM protocol and memberlist failure detector to make it more robust. It addresses issues seen in the field of healthy nodes being falsely detected as failed. - Lifeguard introduces three components: self-awareness using a node status counter, dogpiling requiring multiple independent suspicions before failure detection, and a buddy system to more quickly allow nodes to refute suspicions. - Experiments show Lifeguard significantly reduces false positives, with modest increases to latency and network load in pathological situations. It provides a tunable way for users to tradeoff failures detection speed and false positives.

Lifting the Blinds: Monitoring Windows Server 2012

Datadog

Openstack meetup lyon_2017-09-28

Xavier Lucas

This document summarizes the key aspects of a public cloud archive storage solution. It offers affordable and unlimited storage using standard transfer protocols. Data is stored using erasure coding for redundancy and fault tolerance. Accessing archived data takes 10 minutes to 12 hours depending on previous access patterns, with faster access for inactive archives. The solution uses middleware to handle sealing and unsealing archives along with tracking access patterns to regulate retrieval times.

Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015

Datadog

Leveraging open source tools to gain insight into OpenStack Swift

Dmitry Sotnikov

Performance monitoring and troubleshooting of cloud based object storage is as much an art as science. Although there are a plethora of open source monitoring tools which gather system metrics, the real challenge is how to utilize them to find the root cause of a problem. In this presentation we present a general, open source based, step-by-step methodology to understand performance bottlenecks in a OpenStack Swift system. Our approach uses standard tools including logstash, collectd, statsd, elasticsearch, kibana and graphite. We also describe an additional simple Swift middleware we developed to help gain further insights. Finally, we demonstrate results obtained from our approach used in an internal deployment of OpenStack Swift.

Resource Scheduling using Apache Mesos in Cloud Native Environments

Sharma Podila

This document discusses using Apache Mesos for scheduling heterogeneous resources in a cloud environment. It describes Mantis, a Mesos framework for reactive stream processing. Mantis provides lightweight jobs, dynamic scaling, and custom SLAs. Fenzo is introduced as Mantis' task scheduler, which uses plugins for constraints, fitness functions, and autoscaling. Mantis allows for stream locality, backpressure handling, and job autoscaling. The document argues that Mesos provides benefits over instance-level scheduling through finer-grained resource allocation and faster task startup times.

Jan 2012 HUG: Storm

Yahoo Developer Network

Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it’s fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language. Storm was open-sourced by Twitter in September of 2011 and has since been adopted by many companies around the world. Storm has a wide range of use cases, from stream processing to continuous computation to distributed RPC. In this talk I'll introduce Storm and show how easy it is to use for realtime computation.

- Fred de Villamil is the director of infrastructure at Synthesio and has been working with Linux/BSD and open source since 1996. - Synthesio uses Elasticsearch to power over 13,000 dashboards, indexing over 75 billion documents and 200TB of data across 5 clusters with 163 servers and 400TB of storage. - They initially had performance issues with cross-cluster queries in MySQL but migrated to Elasticsearch in 2015 and saw significant performance improvements with their "Clipping Revolution" implementation. - Over time they encountered issues at scale including too many shards, slow restarts, and garbage collection problems. They optimized their implementation with changes like rack awareness, G1GC tuning, and field data cache configuration.

Prezo at-mesos con2015-final

Sharma Podila

Realtime Statistics based on Apache Storm and RocketMQ

Xin Wang

Determinism in finance

Peter Lawrey

Stabilising the jenga tower

Gordon Chung

The document discusses the evolution of Ceilometer, an OpenStack project that collects measurements from deployed clouds and persists the data for later retrieval and analysis. It describes how Ceilometer has scaled out its data collection capabilities over time by adding agents, partitioning workloads, and integrating with Gnocchi to provide more efficient time-series storage. The document also provides best practices for Ceilometer deployment and configuration to optimize data collection, storage and querying.

Securing Sharded Networks with Swarm

Fluence Labs

This document proposes a hybrid approach to securely sharding decentralized databases with low redundancy. It involves using a real-time validator layer for fast transactions, a shared fisherman pool for independent verification of transaction histories uploaded to a decentralized storage network like Swarm, and smart contracts on Ethereum to resolve disputes. This approach reduces the risk of shard takeover to 0.00037% while keeping redundancy costs low compared to naive consensus-based or blockchain-only approaches.

Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime

Fred de Villamil

Network Monitoring with Icinga

learjk

Low level java programming

Peter Lawrey

Introduction to Storm

Eugene Dvorkin

Storm is an open-source distributed real-time computation system. It provides a framework for processing unbounded streams of data reliably and fault-tolerantly. Storm allows data to be analyzed in real-time using spouts, bolts, and topologies. It is scalable, fault-tolerant, guarantees processing, and is easy to code. Storm powers many real-time systems at Twitter and is useful for applications like analytics, personalization, and ETL.

Using Simplicity to Make Hard Big Data Problems Easy

nathanmarz

The document proposes a simple approach to solving a complex problem of computing unique visitors over time ranges that involves maintaining normalized and denormalized views of the data. The approach involves: 1) Storing all data in a master dataset and continuously recomputing indexes and views as a function of all the data to maintain normalized and denormalized views. 2) Querying both recent real-time views and historical batch views to retrieve the necessary data for a time range query, combining for high performance and accuracy. 3) Approximating unique counts for recent data by ignoring real-time equivalences to keep the real-time layer simple while still providing good query performance and eventual accuracy.

PHP Backends for Real-Time User Interaction using Apache Storm.

DECK36

Engaging users in real-time is the topic of our times. Whether it’s a game, a shop, or a content-network, the aim remains the same: providing a personalized experience. In this workshop we will look under the hood of Apache Storm and lay a firm foundation on how to use it with PHP. By that, you can leverage your existing codebase and PHP expertise for an entirely new world: real-time analytics and business logic working on message streams. During the course of the workshop, we will introduce Apache Storm and take a look at all of its components. We will then skyrocket the applicability of Storm by showing you how to implement their components with PHP. All exercises will be conducted using an example project, the infamous and most exhilarating lolcat kitten game ever conceived: Plan 9 From Outer Kitten. In order to follow the hands-on excercises, you will need a development VM prepared by us with all relevant system components and our project repositories. To make the workshop experience as smooth as possible for all participants, please bring a prepared computer to the workshop, as there will be no time to deal with installation and setup issues. Please download all prerequisites and install them as described: VM, Plan 9 webapp, Plan 9 storm backend, (Tutorial: https://github.com/DECK36/plan9_workshop_tutorial ).

Real-Time Big Data at In-Memory Speed, Using Storm

Nati Shalom

Storm, a popular framework from Twitter, is used for real-time event processing. The challenge presented is how to manage the state of your real-time data processing at all times. In addition, you need Storm to integrate with your batch processing system (such as Hadoop) in a consistent manner. This session will demonstrate how to integrate Storm with an in-memory database/grid, and explore various strategies for integrating the data grid with Hadoop and Cassandra, seamlessly. By achieving smooth integration with consistent management, you will be able to easily manage all the tiers of you Big Data stack in a consistent and effective way. - See more at: http://nosql2013.dataversity.net/sessionPop.cfm?confid=74&proposalid=5526#sthash.FWIdqRHh.dpuf

Activation limit&flowlimit&maxjobs

Nirmala Devi

The document discusses various configuration parameters for process engines: Max Jobs sets the maximum number of concurrent process instances in memory; Activation Limit loads process instances sequentially into memory one at a time; and Flow Limit sets the maximum number of concurrently running process instances before suspending new starts. The effects of different configuration combinations are explained.

Webinar: Diagnosing Apache Cassandra Problems in Production

DataStax Academy

This document provides guidance on diagnosing problems in Cassandra production systems. It recommends first using OpsCenter to identify issues, then monitoring servers, applications, and logs. Common problems discussed include incorrect timestamps, tombstones slowing queries, not using a snitch, version mismatches, and disk space not being reclaimed. Diagnostic tools like htop, iostat, and nodetool are presented. The document also covers JVM garbage collection profiling to identify issues like early object promotion and long minor GCs slowing the system.

Advanced off heap ipc

Peter Lawrey

This document discusses advanced inter-process communication (IPC) techniques using off-heap memory in Java. It introduces OpenHFT, a company that develops low-latency software, and their open-source projects Chronicle and OpenHFT Collections that provide high-performance IPC and embedded data stores. It then discusses problems with on-heap memory and solutions using off-heap memory mapped files for sharing data across processes at microsecond latency levels and high throughput.

Redis

George Li

This document discusses common problems a platform engineer may see with Elasticcache and provides solutions. It covers issues related to unexpected behavior, performance, cluster stability, and HA/DR. Specific problems addressed include data becoming stale when not following the cache-aside pattern, latency increases from Redis calls in transactions, large key sizes causing spikes, empty cache values when the database has no value, and lack of reconciliation logic. Solutions involve updating empty cache values, using bloom filters, and ensuring availability during cache penetration or stampedes. Distributed locking challenges and sharding without online resharding are also covered, along with metrics to monitor like cache hit rate and the Datadog ElasticCache dashboard.

Slide #1:Introduction to Apache Storm

Md. Shamsur Rahim

This document provides an introduction to Storm, an open source distributed real-time processing system. It discusses the types of data processing in Storm as either batch or real-time. The key components of a Storm cluster are the Nimbus master node, supervisor worker nodes, and ZooKeeper coordination service. A Storm topology defines the computation as a directed acyclic graph of spouts emitting streams and bolts processing the streams.

Learning Stream Processing with Apache Storm

Eugene Dvorkin

Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc. In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well. Following topics will be covered: • Why use Apache Storm? • Common use cases • Storm Architecture - components, concepts, topology • Building simple Storm topology with Java and Groovy • Trident and micro-batch processing • Fault tolerance and guaranteed message delivery • Running and monitoring Storm in production • Kafka • Storm at WebMD • Resources

Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...

DataStax

Data is being collected more and more every year. Cloud applications, including IoT, web, and mobile send torrents of bits at our data centers that have to be processed and stored. In addition, users expect an always-on experience, with little room for error. Numerous companies are successfully doing this every day. In this webinar, you will learn about the convergence of complementary technologies: Spark, Mesos, Akka, Cassandra and Kafka (SMACK), how Apache Kafka can help you get your data under control and the critical role Kafka plays in your data pipeline. Webinar recording: https://youtu.be/uwYlwLyv-1s Webinar Q&A will be posted shortly.

Elastic search

Torstein Hansen

Five Lines I Could Not Draw

Librato, Inc.

The premise of Dave's talk: "Good monitoring changes people." Through his evolving experience with monitoring, Dave Josephsen realized he had been “carrying a misapprehension about what monitoring was and who it was for.” His prior experiences with monitoring were much like those of folks he nowadays meets at conferences: monitoring is terrible, alerts are flooding from everything, and the world is probably burning right now. Through observing all teams - ops, data engineering, design - interact at Librato, he realized that the purpose of monitoring isn’t creating alerts but asking questions. There is no “owner” of monitoring, as everyone has the ability to measure things and ask their own questions.

What's hot

Scaling Elasticsearch at Synthesio

Fred de Villamil

Prezo at-mesos con2015-final

Sharma Podila

Realtime Statistics based on Apache Storm and RocketMQ

Xin Wang

Determinism in finance

Peter Lawrey

Stabilising the jenga tower

Gordon Chung

Securing Sharded Networks with Swarm

Fluence Labs

Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime

Fred de Villamil

Network Monitoring with Icinga

learjk

Low level java programming

Peter Lawrey

Introduction to Storm

Eugene Dvorkin

Using Simplicity to Make Hard Big Data Problems Easy

nathanmarz

PHP Backends for Real-Time User Interaction using Apache Storm.

DECK36

Real-Time Big Data at In-Memory Speed, Using Storm

Nati Shalom

Activation limit&flowlimit&maxjobs

Nirmala Devi

Webinar: Diagnosing Apache Cassandra Problems in Production

DataStax Academy

Advanced off heap ipc

Peter Lawrey

Redis

George Li

Slide #1:Introduction to Apache Storm

Md. Shamsur Rahim

Learning Stream Processing with Apache Storm

Eugene Dvorkin

Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...

DataStax

What's hot (20)

Scaling Elasticsearch at Synthesio

Prezo at-mesos con2015-final

Realtime Statistics based on Apache Storm and RocketMQ

Determinism in finance

Stabilising the jenga tower

Securing Sharded Networks with Swarm

Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime

Network Monitoring with Icinga

Low level java programming

Introduction to Storm

Using Simplicity to Make Hard Big Data Problems Easy

PHP Backends for Real-Time User Interaction using Apache Storm.

Real-Time Big Data at In-Memory Speed, Using Storm

Activation limit&flowlimit&maxjobs

Webinar: Diagnosing Apache Cassandra Problems in Production

Advanced off heap ipc

Redis

Slide #1:Introduction to Apache Storm

Learning Stream Processing with Apache Storm

Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...

Viewers also liked

Elastic search

Torstein Hansen

Five Lines I Could Not Draw

Librato, Inc.

Scaling Pinterest's Monitoring

Brian Overstreet

This document summarizes Brian Overstreet's talk on scaling Pinterest's monitoring system over time as the company and traffic grew. It describes how Pinterest started with just Ganglia for system metrics and no application metrics. They introduced Graphite but faced challenges with packet loss and metrics being dropped. They then introduced OpenTSDB which users were happier with due to its querying speed. Pinterest developed an agent-based pipeline using Kafka and Storm to address packet loss issues and allow over 1.5 million points per second to be ingested by OpenTSDB. Key lessons included the need to educate users, control incoming metrics, and ensure the monitoring system scales with engineers rather than just site users.

Monitoring, graphs and visualisations

morekid

Statistics for Engineers

Heinrich Hartmann

1) Heinrich Hartmann presented on statistics and monitoring for engineers. He discussed various methods for API monitoring including external monitoring, log analysis, and measuring latency averages and percentiles. 2) Histograms were presented as another method that involves dividing the latency and time scales into bands and reporting periods to count samples, allowing flexible analysis while enabling aggregation. 3) Takeaways included being wary of line graphs, not aggregating percentiles but instead using histograms, keeping all raw data, and striving for meaningful metrics.

Monitorama: How monitoring can improve the rest of the company

Jeff Weinstein

Monitoring can improve the entire company by sharing data and techniques across teams. By implementing structured logging, automatic metrics collection, and common data visualization tools, monitoring can become the central data platform. This allows all teams like developers, analysts, and executives to access insights that help improve products, prioritize issues, and make data-driven decisions.

Monitorama 2016

Sarah Hagan

Sessionization with Spark streaming

Ramūnas Urbonas

The document discusses sessionization with Spark streaming to analyze user sessions from a constant stream of page visit data. Key points include: - Streaming page visit data presents challenges like joining new visits to ongoing sessions and handling variable data volumes and long user sessions. - The proposed solution uses Spark streaming to join a checkpoint of incomplete sessions with new visit data to calculate session metrics in real-time. - Important aspects are controlling data ingress size and partitioning to optimize performance of operations like joins and using custom formats to handle output to multiple sinks.

Everything obfuscurity taught me about monitoring

Pete Cheslock

The document appears to be a series of tweets from Pete Cheslock about monitoring best practices and lessons learned over his career. Some key points discussed include using Graphite for time-series data collection and storage, leveraging existing tools like StatsD that developers are already using, building services that are consumable by developers to encourage cross-team collaboration, and focusing on solving your own company's problems rather than trying to replicate what large companies do.

Infrastructure as code might be literally impossible part 2

ice799

The document discusses various issues with infrastructure as code including complexities that arise from software licenses, bugs, and inconsistencies across tools and platforms. Specific examples covered include problems with SSL and APT package management on Debian/Ubuntu, Linux networking configuration difficulties, and inconsistencies in Python packaging related to naming conventions for packages containing hyphens, underscores, or periods. Potential causes discussed include legacy code, lack of time for thorough testing and bug fixing, and economic pressures against developing fully working software systems.

2016 metrics-as-culture

Nicole Forsgren

Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...

Adrian Cockcroft

Google Cloud Platform: Prototype ->Production-> Planet scale

Idan Tohami

All of Your Network Monitoring is (probably) Wrong

ice799

Monitoring Challenges - Monitorama 2016 - Monitoringless

Adrian Cockcroft

Viewers also liked (15)

Elastic search

Five Lines I Could Not Draw

Scaling Pinterest's Monitoring

Monitoring, graphs and visualisations

Statistics for Engineers

Monitorama: How monitoring can improve the rest of the company

Monitorama 2016

Sessionization with Spark streaming

Everything obfuscurity taught me about monitoring

Infrastructure as code might be literally impossible part 2

2016 metrics-as-culture

Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...

Google Cloud Platform: Prototype ->Production-> Planet scale

All of Your Network Monitoring is (probably) Wrong

Monitoring Challenges - Monitorama 2016 - Monitoringless

Similar to Fake It 'Til You Make It

Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab

CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2kvXlPd This CloudxLab Introduction to Apache ZooKeeper tutorial helps you to understand ZooKeeper in detail. Below are the topics covered in this tutorial: 1) Data Model 2) Znode Types 3) Persistent Znode 4) Sequential Znode 5) Architecture 6) Election & Majority Demo 7) Why Do We Need Majority? 8) Guarantees - Sequential consistency, Atomicity, Single system image, Durability, Timeliness 9) ZooKeeper APIs 10) Watches & Triggers 11) ACLs - Access Control Lists 12) Usecases 13) When Not to Use ZooKeeper

How does the Cloud Foundry Diego Project Run at Scale?

VMware Tanzu

How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...

Amit Gupta

Big data from the LHC commissioning: practical lessons from big science - Sim...

jaxLondonConference

Presented at JAX London 2013 The Large Hadron Collider experiments manage tens of petabytes of data spread across hundreds of data centres. Managing and processing this volume required significant infrastructure and novel software systems, involving years of R&D and significant commissioning to prepare for the LHC First Data. The evolution of this global computing infrastructure, and the specialisations made by the experiments, have lessons relevant for many commercial "big data" users.

Real time analytics using Hadoop and Elasticsearch

Abhishek Andhavarapu

This document discusses using Hadoop and Elasticsearch for real-time analytics. It provides an overview of Elasticsearch, including how it is document-oriented, schema-free, distributed and fast. It also demonstrates indexing, retrieving, updating and deleting documents from Elasticsearch. The demo portion involves extracting data from a SQL database using Hive, transforming it with Hadoop/Hive, and loading it into Elasticsearch to run queries. Lessons learned focus on concurrency, filtering, field data caching and JVM memory usage.

London devops logging

Tomas Doran

Performance

Christophe Marchal

Алексей Петров "PHP at Scale: Knowing enough to be dangerous!"

Fwdays

PHP at Scale: Knowing enough to be dangerous! by Oleksii Petrov discusses how to scale PHP applications. It covers strategies like caching, queueing, read/write splitting, and sharding. It also discusses using load balancers and choosing the right database. The key is to improve system metrics without dramatically changing the system. Scaling is predefined by your stack and architecture. Performance comes from optimizations everywhere, not just PHP. Being distributed is very challenging.

MSR 2009

swy351

The document proposes using MapReduce as a general framework to support research in mining software repositories (MSR). It describes how MapReduce can provide efficiency, scalability, adaptability and flexibility for common MSR tasks like analyzing large code repositories. A case study of applying MapReduce to the J-REX MSR tool shows significant reductions in running time for large datasets. Minimal programming effort was required and MapReduce could run on various computing environments.

2013 py con awesome big data algorithms

c.titus.brown

This document provides an overview of algorithms for analyzing large datasets, referred to as "big data". It discusses skip lists, HyperLogLog counting, and Bloom filters as examples of probabilistic data structures that can be used for problems involving big data. These algorithms provide approximate answers but are more scalable and memory efficient than exact algorithms. The document also describes applications of these algorithms to analyzing shotgun DNA sequencing data from metagenomics studies.

Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...

Flink Forward

At Trackunit we have based our telematic IoT processing pipeline on Flink. We started out on version 1.2 and are now on 1.5. In this session I will share the lessons learned going from one giant Flink job to many smalls and some of the problems we have seen operating Flink on AWS EMR cluster, including topics such as: • Why external enrichment can be challenging with Flink Async operator. • Pattern to change external enrichment into streaming join. • Building your own source • Why Flink restart is great but should be avoided as it will terminate your cluster. • Why iteration can cause deadlocking when backpressure occurs. • Kinesis rate exceeded exception • Why throttling Flink source read during catchup is needed. • Why we moved from EMR/Kinesis and into DC/OS and kafka. • And much more.

Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...

Paul Brebner

Apache Kafka, Apache Cassandra and Kubernetes are open source big data technologies enabling applications and business operations to scale massively and rapidly. While Kafka and Cassandra underpins the data layer of the stack providing capability to stream, disseminate, store and retrieve data at very low latency, Kubernetes is a container orchestration technology that helps in automated application deployment and scaling of application clusters. In this presentation, Paul will reveal how he architected a massive scale deployment of a streaming data pipeline with Kafka and Cassandra to cater to an example Anomaly detection application running on a Kubernetes cluster and generating and processing massive amount of events. Anomaly detection is a method used to detect unusual events in an event stream. It is widely used in a range of applications such as financial fraud detection, security, threat detection, website user analytics, sensors, IoT, system health monitoring, etc. When such applications operate at massive scale generating millions or billions of events, they impose significant computational, performance and scalability challenges to anomaly detection algorithms and data layer technologies. Paul will demonstrate the scalability, performance and cost effectiveness of Apache Kafka, Cassandra and Kubernetes, with results from his experiments allowing the Anomaly detection application to scale to 19 Billion anomaly checks per day. Melbourne Big Data Meetup, March 5 2020 https://www.eventbrite.com/e/melbourne-big-data-meetup-realtime-anomaly-detection-with-cassandra-kafka-tickets-93028445585

Scaling an ELK stack at bol.com

Renzo Tomà

A presentation about the deployment of an ELK stack at bol.com At bol.com we use Elasticsearch, Logstash and Kibana in a logsearch system that allows our developers and operations people to easilly access and search thru logevents coming from all layers of its infrastructure. The presentations explains the initial design and its failures. It continues with explaining the latest design (mid 2014). Its improvements. And finally a set of tips are giving regarding Logstash and Elasticsearch scaling. These slides were first presented at the Elasticsearch NL meetup on September 22nd 2014 at the Utrecht bol.com HQ.

Zero ETL analytics with LLAP in Azure HDInsight

Ashish Thapliyal

This document discusses HDInsight interactive query architecture and performance. It summarizes that: 1. HDInsight uses LLAP (Low Latency Analytical Processing) clusters to serve queries directly from Azure blob storage and data lake store for fast performance on text data. 2. Testing showed LLAP had high query concurrency and interactive query speed compared to Spark SQL and Presto. 3. The document also outlines HDInsight's logging architecture where the OMS agent collects logs and metrics from HDInsight clusters and sends them to Log Analytics for analysis.

Scaling HDFS for Exabyte Storage@twitter

lohitvijayarenu

This document summarizes lessons learned from scaling HDFS storage at Twitter to over 1 exabyte across tens of thousands of nodes. Some key challenges discussed include identifying scale limits through benchmarking, abstracting access across multiple clusters and datacenters, implementing extensive metrics and auditing, preventing single points of failure, handling failures and slowdowns silently, understanding network bottlenecks, implementing throttling, preventing data loss, carefully planning upgrades, and monitoring all aspects of the system. The lessons have helped Twitter scale HDFS and are also useful for scaling other systems.

Fact based monitoring

Datadog

This document discusses moving from host-centric monitoring to fact-based monitoring using Puppet facts. It argues that hosts should not be the center of the monitoring universe, but rather facts should be. Effective monitoring uses queries against existing facts and metrics to express conditions like ensuring web servers respond quickly or PostgreSQL processes are running. This mirrors how Puppet, SQL, and MCollective improved systems management by moving from imperative programming to declarative queries based on available facts and metadata.

Fact-Based Monitoring

Datadog

Your configuration management is fact-based. Your orchestration is fact-based. Is your monitoring fact-based? What does that even mean? Monitoring is very similar to configuration, at least in its expression. Configuration cares about files, services, and hosts being present and in a certain state (""nginx should be running with the following configuration""). Monitoring cares about services being present, running, and in a certain state. Both describe your infrastructure as it should be (""nginx should be running and respond in less than 200ms""). Fact-based monitoring is about being able to control monitoring with the same facts that Puppet uses (""monitor nginx latency wherever Puppet says it should run""). This is in contrast with imperative monitoring (""monitor nginx on host a, b and c"") that gets out of sync and leads to mailbox meltdowns from spurious alerts. Using open source and commercial examples, this talk will help you express your monitoring in a way that will feel very natural to your Puppet configuration.

Introduction to Single-cell RNA-seq

Timothy Tickle

This document provides an introduction to single-cell RNA-seq (scRNA-seq) analysis. It discusses different scRNA-seq assays such as Smart-Seq2, Drop-seq, and 10X, and how their protocols and sequencing outputs differ. It also covers scRNA-seq data characteristics like zero inflation and overdispersion. The document outlines common analysis steps like filtering, dimensionality reduction, clustering, and differential expression. It emphasizes that scRNA-seq data requires specialized analysis due to its noisy and sparse nature compared to bulk RNA-seq data.

Diagnosing Problems in Production - Cassandra

Jon Haddad

1) The document discusses various tools for diagnosing problems in Cassandra production environments, including OpsCenter for monitoring, application metrics collection with Statsd/Graphite, and log aggregation with Splunk or Logstash. 2) Some common issues covered are incorrect server times causing data inconsistencies, tombstone overhead slowing queries, not using the proper snitch, and disk space not being reclaimed on new nodes. 3) Diagnostic tools described are htop, iostat, vmstat, dstat, strace, tcpdump, and nodetool for investigating process activity, disk usage, memory, networking, and Cassandra-specific statistics. GC profiling and query tracing are also recommended.

Diagnosing Problems in Production (Nov 2015)

Jon Haddad

Diagnosing Problems in Production involves first preparing monitoring tools like OpsCenter, server monitoring, application metrics, and log aggregation. Common issues include incorrect server times causing data inconsistencies, tombstone overhead slowing queries, not using the proper snitch, and version mismatches breaking functionality. Diagnostic tools like htop, iostat, vmstat, dstat, strace, jstack, nodetool, histograms, and query tracing help narrow down performance problems which could be due to compaction, garbage collection, or other bottlenecks.

Similar to Fake It 'Til You Make It (20)

Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab

How does the Cloud Foundry Diego Project Run at Scale?

How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...

Big data from the LHC commissioning: practical lessons from big science - Sim...

Real time analytics using Hadoop and Elasticsearch

London devops logging

Performance

Алексей Петров "PHP at Scale: Knowing enough to be dangerous!"

MSR 2009

2013 py con awesome big data algorithms

Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...

Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...

Scaling an ELK stack at bol.com

Zero ETL analytics with LLAP in Azure HDInsight

Scaling HDFS for Exabyte Storage@twitter

Fact based monitoring

Fact-Based Monitoring

Introduction to Single-cell RNA-seq

Diagnosing Problems in Production - Cassandra

Diagnosing Problems in Production (Nov 2015)

Recently uploaded

Large Language Model (LLM) and it’s Geospatial Applications

Rohit Gautam

Pushing the limits of ePRTC: 100ns holdover for 100 days

Adtran

TrustArc Webinar - 2024 Global Privacy Survey

TrustArc

How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024? In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores. See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe. This webinar will review: - The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey - The top challenges for privacy leaders, practitioners, and organizations in 2024 - Key themes to consider in developing and maintaining your privacy program

“I’m still / I’m still / Chaining from the Block”

Claudio Di Ciccio

Introduction to CHERI technology - Cybersecurity

mikeeftimakis1

20 Comprehensive Checklist of Designing and Developing a Website

Pixlogix Infotech

Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.

20240605 QFM017 Machine Intelligence Reading List May 2024

Matthew Sinclair

Presentation of the OECD Artificial Intelligence Review of Germany

innovationoecd

Essentials of Automations: The Art of Triggers and Actions in FME

Safe Software

In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation. We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios. Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!

Microsoft - Power Platform_G.Aspiotis.pdf

Uni Systems S.M.S.A.

UiPath Test Automation using UiPath Test Suite series, part 5

DianaGray10

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Aggregage

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf

Malak Abu Hammad

Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers: * What is Vector Search? * Importance and benefits of vector search * Practical use cases across various industries * Step-by-step implementation guide * Live demos with code snippets * Enhancing LLM capabilities with vector search * Best practices and optimization strategies Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications. #MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?

Speck&Tech

ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune. Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile. BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

Monitoring Java Application Security with JDK Tools and JFR Events

Ana-Maria Mihalceanu

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!

SOFTTECHHUB

As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.

20240607 QFM018 Elixir Reading List May 2024

Matthew Sinclair

How to Get CNIC Information System with Paksim Ga.pptx

danishmna97

Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...

Zilliz

Recently uploaded (20)

Large Language Model (LLM) and it’s Geospatial Applications

Pushing the limits of ePRTC: 100ns holdover for 100 days

TrustArc Webinar - 2024 Global Privacy Survey

“I’m still / I’m still / Chaining from the Block”

Introduction to CHERI technology - Cybersecurity

20 Comprehensive Checklist of Designing and Developing a Website

20240605 QFM017 Machine Intelligence Reading List May 2024

Presentation of the OECD Artificial Intelligence Review of Germany

Essentials of Automations: The Art of Triggers and Actions in FME

Microsoft - Power Platform_G.Aspiotis.pdf

UiPath Test Automation using UiPath Test Suite series, part 5

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Monitoring Java Application Security with JDK Tools and JFR Events

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!

20240607 QFM018 Elixir Reading List May 2024

How to Get CNIC Information System with Paksim Ga.pptx

Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...

Fake It 'Til You Make It

1. Fake It ’Til You Make It A case study in generating more data than your systems John Stanford -- @jxstanford

2. Software Infrastructure Leadership

4. • Structure • 1 controller • 6 compute nodes • Usage • Training classes • R & D • Pipeline • 7 Heka log shippers • 1 Heka log collector • 1 Elasticsearch node shipper shipper shipper shipper shipper shipper shipper collector Elasticsearch The Environment

6. Every good recipe starts with onions

7. • 21 – 31K per hour • An obvious pattern • A dead spot • A couple spikes 7 Days

8. By Server • 96% from one node • 1% per remaining node • A couple not reporting

9. By Component • 60% • 25% • 15% • 14% • 1%

10. 7 Day Message Rate Recurring outlier!

11.

12. Message Size

13.

14. • Roughly 25K messages/hour • Controllers are 100x noisier than compute nodes • Swift generates 60% of traffic • 99.9% of the time, there were less than 65 messages/sec • There are some traffic spikes of ~1600 messages/sec • 95% of messages were less than 240 bytes • Largest message was 782 bytes

15. The cookbook says…

16. Model parameters • Flow rate • Hosts • Components • Size • Content • ?

17. The Scientist and Engineer's Guide to Digital Signal Processing - Steven W. Smith, Ph.D.

18. https://goo.gl/Uqa2mf

19. When you don’t put your toque on…

20.

21. shipper shipper shipper shipper shipper shipper shipper collector Elasticsearch

22. [flood_1ms] ip_address = “127.0.0.1:9997” sender = “tcp” pprof_file = “” encoder = “protobuf” num_messages = 0 message_interval = “1ms” max_message_size = 2800 variable_message_size = true late_bind_timestamp = true • Add flood process • Monitor everything • Repeat until it breaks The plan

23. heka-flood Seems like a good idea • The good: • Control over message rate • The not so good: • Not enough control over message content • Timestamps assigned at initialization (pull request forthcoming) • Couldn’t get variable message sizing to work • Here comes the meatloaf…

24. The result

25. Tactical • System sustained 4000 x ~1k msgs/sec • Collector and/or ES started to pause above that • No messages were dropped Strategic • Load tool was underqualified • Monitoring tool resolution is important for interpretation • I should prepare for presentations further in advance What did we learn?

26. Intuitions • 4MB/sec is: • 17K messages of 240 bytes • 240 bytes is 95th percentile • 64 msgs/sec is: • 99.9th percentile • 265 controllers • 1060 compute • 1:100 ratio Our First Model

27. • Find the bottlenecks • System resources? • Elasticsearch? • Collector? • Load generator? • Improve the model • Probability distributions • Noise • Real message samples • Off host load • Regime changes • Real world feedback Next Steps

28. Thanks, questions?

Editor's Notes

Lately, most of my time has been spent with these tools, and generally shifting to the right.
We recently started using Heka instead of Logstash. It’s an experiment, but we’re embracing Go in some other development contexts, and we have not Ruby skills in house. Heka is interesting, but still a young project. I have high hopes for it from a performance perspective, and have had good interactions with the community for both Q & A and code acceptance.
When sales or consultants come to me and ask how many nodes our product can support, I’d like to provide a better answer than the shrug emoji, so we’re going to walk through a modeling exercise.
Let’s start with the big picture.
Since the missing servers are compute nodes, they are most likely have a very similar profile as the other compute nodes.
Query all the logs for 7 days, and do a date histogram on the timestamp with an interval of 1 second. With our amount of data, the aggregation took about 15 seconds. Your situation might be different, so you can adjust the lookback or the interval to suit your needs. We see that 99.9% of the time, there were no more than 64 messages per second, but there are a few outliers. I didn’t include the chart, but I think there were about 10 instances of a rate over 1000 messages/sec. We’ll want to consider those in a robust model.
But for now, let’s remove them so we can see the shape of the message rate histogram. I’ve hand drawn in the red line to highlight what I think is a reasonable shape.
Let’s do a similar analysis of the message size. Here’ well look at the last 7 days, but do a percentiles aggregation. We can see that our max message size was 782 bytes, but that 95% of messages were less than 240 byes. This is only the text component of the message, not the other fields like tags, timestamp, etc., but those things are pretty constant in this environment, so it should be ok.
So, what did we learn out our environment?
Now, we’ve got a bunch of ingredients, and we want to get down to the business of making a model.
First, we might try to select a distribution that looks most like our model parameters. That’s pretty straight forward, and you can use tools like numpy and scipy to generate these curves, then take samples from the line to get a value for a parameter.
You might want to add some noise to your parameters to give them some “texture”. For example, you might want to vary the cadence of the message stream by making a decision at each sample time about how many messages to generate, or increase or decrease the message size.
And then all you need to do is execute flawlessly.
But what happens when you left your not a professional data modeler, or your knives aren’t sharp, or people are pounding the forks and knives on the table… You probably end up with something like...
Meatloaf. Arguably not horrible, but no Thomas Keller dish either.
Let’s revisit what what we have to work with. Shippers, collectors, and a datastore. The shippers and collectors are distributed. In our case, the datastore is on the same node as the collector. It’s unlikely that an individual shipper will generate more data than the network, or an individual collector can handle, so we’ll focus on the flooding the collector from the local host, and see how the collector / elasticsearch combo handle things. There’s a chance that we can overwork the OS, but this still seems like a good starting point to isolate bottlenecks.
Heka has a load testing utility called flood. I tend to build a mental model of what these types of utilities should do, then by the time I get a chance to prototype a solution using them, I can only hope they they are close to my mental model. In this case, I missed the mark a little, and hit a few other snags, but overall, I was able to demonstrate what I had hoped to.
A simplified overview of what I found was that everything ramped up pretty well until I got over 5500 messages/sec, then things started to destabilize. The shippers started pausing for a few seconds at a time, then would restart, run for a while and pause again. I included two views of this behavior to show how the sample rate can mask or highlight system behavior. Marvel (on top) is aggregating over 10 seconds, Kibana (bottom) is aggregating over 1 second. You can see the destabilization in Marvel, but it’s much more obvious in Kibana.
…. But on to our first model...
… So, when someone asks if we can handle traffic from 100 controllers and 15000 compute nodes, I can say “The model says yes!”, but when they ask if we can handle 250 controllers and 10000 compute nodes, I can say “The model says “Oh, no. That would be a bad idea...”
Well, we didn’t really find the bottleneck, which means we didn’t get the chance to tune it or scale it. We also have a very simple model using fairly simple tools. We could definitely improve that. Finally, we would want to take some real world samples to support or refute our model. In other words, we should continuously improvement via understanding our environment, tuning parameters, and validating our model with feedback from the real world.
Since I paid for the right to use Meatloaf, he would like to say Thanks, and ask if you have any questions…

Fake It 'Til You Make It

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Fake It 'Til You Make It

Similar to Fake It 'Til You Make It (20)

Recently uploaded

Recently uploaded (20)

Fake It 'Til You Make It

Editor's Notes