SlideShare a Scribd company logo
Fake It ’Til You Make It
A case study in generating more data than your systems
John Stanford -- @jxstanford
Software Infrastructure Leadership
• Structure
• 1 controller
• 6 compute nodes
• Usage
• Training classes
• R & D
• Pipeline
• 7 Heka log shippers
• 1 Heka log collector
• 1 Elasticsearch node
shipper
shipper
shipper
shipper
shipper
shipper
shipper
collector Elasticsearch
The Environment
shipper
shipper
shipper
shipper
shipper
shipper
shipper
collector Elasticsearch
shipper
shipper
shipper
shipper
shipper
shipper
shipper
collector Elasticsearch
Elasticsearch
Elasticsearch
Elasticsearch
Elasticsearch
Elasticsearch
Elasticsearch
collector
collector
collector
collector
collector
collector
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
shipper
¯_(ツ)_
/¯
Every good recipe starts with onions
• 21 – 31K per hour
• An obvious pattern
• A dead spot
• A couple spikes
7 Days
By Server
• 96% from one node
• 1% per remaining node
• A couple not reporting
By Component
• 60%
• 25%
• 15%
• 14%
• 1%
7 Day Message Rate
Recurring outlier!
Message Size
• Roughly 25K messages/hour
• Controllers are 100x noisier than compute nodes
• Swift generates 60% of traffic
• 99.9% of the time, there were less than 65 messages/sec
• There are some traffic spikes of ~1600 messages/sec
• 95% of messages were less than 240 bytes
• Largest message was 782 bytes
The cookbook says…
Model parameters
• Flow rate
• Hosts
• Components
• Size
• Content
• ?
The Scientist and Engineer's Guide to Digital Signal Processing - Steven W. Smith, Ph.D.
https://goo.gl/Uqa2mf
When you don’t put your toque on…
shipper
shipper
shipper
shipper
shipper
shipper
shipper
collector Elasticsearch
[flood_1ms]
ip_address = “127.0.0.1:9997”
sender = “tcp”
pprof_file = “”
encoder = “protobuf”
num_messages = 0
message_interval = “1ms”
max_message_size = 2800
variable_message_size = true
late_bind_timestamp = true
• Add flood process
• Monitor everything
• Repeat until it breaks
The plan
heka-flood
Seems like a good idea
• The good:
• Control over message rate
• The not so good:
• Not enough control over message content
• Timestamps assigned at initialization (pull request forthcoming)
• Couldn’t get variable message sizing to work
• Here comes the meatloaf…
The result
Tactical
• System sustained 4000 x ~1k msgs/sec
• Collector and/or ES started to pause above that
• No messages were dropped
Strategic
• Load tool was underqualified
• Monitoring tool resolution is important for interpretation
• I should prepare for presentations further in advance
What did we learn?
Intuitions
• 4MB/sec is:
• 17K messages of 240 bytes
• 240 bytes is 95th percentile
• 64 msgs/sec is:
• 99.9th percentile
• 265 controllers
• 1060 compute
• 1:100 ratio
Our First Model
• Find the bottlenecks
• System resources?
• Elasticsearch?
• Collector?
• Load generator?
• Improve the model
• Probability distributions
• Noise
• Real message samples
• Off host load
• Regime changes
• Real world feedback
Next Steps
Thanks,
questions?

More Related Content

What's hot

Scaling Elasticsearch at Synthesio
Scaling Elasticsearch at SynthesioScaling Elasticsearch at Synthesio
Scaling Elasticsearch at Synthesio
Fred de Villamil
 
Prezo at-mesos con2015-final
Prezo at-mesos con2015-finalPrezo at-mesos con2015-final
Prezo at-mesos con2015-final
Sharma Podila
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
Xin Wang
 
Determinism in finance
Determinism in financeDeterminism in finance
Determinism in finance
Peter Lawrey
 
Stabilising the jenga tower
Stabilising the jenga towerStabilising the jenga tower
Stabilising the jenga tower
Gordon Chung
 
Securing Sharded Networks with Swarm
Securing Sharded Networks with SwarmSecuring Sharded Networks with Swarm
Securing Sharded Networks with Swarm
Fluence Labs
 
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without DowntimeMigrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
Fred de Villamil
 
Network Monitoring with Icinga
Network Monitoring with IcingaNetwork Monitoring with Icinga
Network Monitoring with Icinga
learjk
 
Low level java programming
Low level java programmingLow level java programming
Low level java programming
Peter Lawrey
 
Introduction to Storm
Introduction to StormIntroduction to Storm
Introduction to Storm
Eugene Dvorkin
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
nathanmarz
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
DECK36
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
Nati Shalom
 
Activation limit&flowlimit&maxjobs
Activation limit&flowlimit&maxjobsActivation limit&flowlimit&maxjobs
Activation limit&flowlimit&maxjobs
Nirmala Devi
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
DataStax Academy
 
Advanced off heap ipc
Advanced off heap ipcAdvanced off heap ipc
Advanced off heap ipc
Peter Lawrey
 
Redis
RedisRedis
Redis
George Li
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
Md. Shamsur Rahim
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
Eugene Dvorkin
 
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
DataStax
 

What's hot (20)

Scaling Elasticsearch at Synthesio
Scaling Elasticsearch at SynthesioScaling Elasticsearch at Synthesio
Scaling Elasticsearch at Synthesio
 
Prezo at-mesos con2015-final
Prezo at-mesos con2015-finalPrezo at-mesos con2015-final
Prezo at-mesos con2015-final
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
 
Determinism in finance
Determinism in financeDeterminism in finance
Determinism in finance
 
Stabilising the jenga tower
Stabilising the jenga towerStabilising the jenga tower
Stabilising the jenga tower
 
Securing Sharded Networks with Swarm
Securing Sharded Networks with SwarmSecuring Sharded Networks with Swarm
Securing Sharded Networks with Swarm
 
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without DowntimeMigrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
 
Network Monitoring with Icinga
Network Monitoring with IcingaNetwork Monitoring with Icinga
Network Monitoring with Icinga
 
Low level java programming
Low level java programmingLow level java programming
Low level java programming
 
Introduction to Storm
Introduction to StormIntroduction to Storm
Introduction to Storm
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
Activation limit&flowlimit&maxjobs
Activation limit&flowlimit&maxjobsActivation limit&flowlimit&maxjobs
Activation limit&flowlimit&maxjobs
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
Advanced off heap ipc
Advanced off heap ipcAdvanced off heap ipc
Advanced off heap ipc
 
Redis
RedisRedis
Redis
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
 

Viewers also liked

Elastic search
Elastic searchElastic search
Elastic search
Torstein Hansen
 
Five Lines I Could Not Draw
Five Lines I Could Not DrawFive Lines I Could Not Draw
Five Lines I Could Not Draw
Librato, Inc.
 
Scaling Pinterest's Monitoring
Scaling Pinterest's MonitoringScaling Pinterest's Monitoring
Scaling Pinterest's Monitoring
Brian Overstreet
 
Monitoring, graphs and visualisations
Monitoring, graphs and visualisationsMonitoring, graphs and visualisations
Monitoring, graphs and visualisations
morekid
 
Statistics for Engineers
Statistics for EngineersStatistics for Engineers
Statistics for Engineers
Heinrich Hartmann
 
Monitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the companyMonitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the company
Jeff Weinstein
 
Monitorama 2016
Monitorama 2016Monitorama 2016
Monitorama 2016
Sarah Hagan
 
Sessionization with Spark streaming
Sessionization with Spark streamingSessionization with Spark streaming
Sessionization with Spark streaming
Ramūnas Urbonas
 
Everything obfuscurity taught me about monitoring
Everything obfuscurity taught me about monitoringEverything obfuscurity taught me about monitoring
Everything obfuscurity taught me about monitoring
Pete Cheslock
 
Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2
ice799
 
2016 metrics-as-culture
2016 metrics-as-culture2016 metrics-as-culture
2016 metrics-as-culture
Nicole Forsgren
 
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Adrian Cockcroft
 
Google Cloud Platform: Prototype ->Production-> Planet scale
Google Cloud Platform: Prototype ->Production-> Planet scaleGoogle Cloud Platform: Prototype ->Production-> Planet scale
Google Cloud Platform: Prototype ->Production-> Planet scale
Idan Tohami
 
All of Your Network Monitoring is (probably) Wrong
All of Your Network Monitoring is (probably) WrongAll of Your Network Monitoring is (probably) Wrong
All of Your Network Monitoring is (probably) Wrong
ice799
 
Monitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - MonitoringlessMonitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - Monitoringless
Adrian Cockcroft
 

Viewers also liked (15)

Elastic search
Elastic searchElastic search
Elastic search
 
Five Lines I Could Not Draw
Five Lines I Could Not DrawFive Lines I Could Not Draw
Five Lines I Could Not Draw
 
Scaling Pinterest's Monitoring
Scaling Pinterest's MonitoringScaling Pinterest's Monitoring
Scaling Pinterest's Monitoring
 
Monitoring, graphs and visualisations
Monitoring, graphs and visualisationsMonitoring, graphs and visualisations
Monitoring, graphs and visualisations
 
Statistics for Engineers
Statistics for EngineersStatistics for Engineers
Statistics for Engineers
 
Monitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the companyMonitorama: How monitoring can improve the rest of the company
Monitorama: How monitoring can improve the rest of the company
 
Monitorama 2016
Monitorama 2016Monitorama 2016
Monitorama 2016
 
Sessionization with Spark streaming
Sessionization with Spark streamingSessionization with Spark streaming
Sessionization with Spark streaming
 
Everything obfuscurity taught me about monitoring
Everything obfuscurity taught me about monitoringEverything obfuscurity taught me about monitoring
Everything obfuscurity taught me about monitoring
 
Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2
 
2016 metrics-as-culture
2016 metrics-as-culture2016 metrics-as-culture
2016 metrics-as-culture
 
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
Monitorama - Please, no more Minutes, Milliseconds, Monoliths or Monitoring T...
 
Google Cloud Platform: Prototype ->Production-> Planet scale
Google Cloud Platform: Prototype ->Production-> Planet scaleGoogle Cloud Platform: Prototype ->Production-> Planet scale
Google Cloud Platform: Prototype ->Production-> Planet scale
 
All of Your Network Monitoring is (probably) Wrong
All of Your Network Monitoring is (probably) WrongAll of Your Network Monitoring is (probably) Wrong
All of Your Network Monitoring is (probably) Wrong
 
Monitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - MonitoringlessMonitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - Monitoringless
 

Similar to Fake It 'Til You Make It

Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
How does the Cloud Foundry Diego Project Run at Scale?
How does the Cloud Foundry Diego Project Run at Scale?How does the Cloud Foundry Diego Project Run at Scale?
How does the Cloud Foundry Diego Project Run at Scale?
VMware Tanzu
 
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
Amit Gupta
 
Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...
jaxLondonConference
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
Abhishek Andhavarapu
 
London devops logging
London devops loggingLondon devops logging
London devops logging
Tomas Doran
 
Performance
PerformancePerformance
Performance
Christophe Marchal
 
Алексей Петров "PHP at Scale: Knowing enough to be dangerous!"
Алексей Петров "PHP at Scale: Knowing enough to be dangerous!"Алексей Петров "PHP at Scale: Knowing enough to be dangerous!"
Алексей Петров "PHP at Scale: Knowing enough to be dangerous!"
Fwdays
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
swy351
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
c.titus.brown
 
Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...
Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...
Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...
Flink Forward
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Paul Brebner
 
Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.com
Renzo Tomà
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsight
Ashish Thapliyal
 
Scaling HDFS for Exabyte Storage@twitter
Scaling HDFS for Exabyte Storage@twitterScaling HDFS for Exabyte Storage@twitter
Scaling HDFS for Exabyte Storage@twitter
lohitvijayarenu
 
Fact based monitoring
Fact based monitoringFact based monitoring
Fact based monitoring
Datadog
 
Fact-Based Monitoring
Fact-Based MonitoringFact-Based Monitoring
Fact-Based Monitoring
Datadog
 
Introduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqIntroduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seq
Timothy Tickle
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
Jon Haddad
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)
Jon Haddad
 

Similar to Fake It 'Til You Make It (20)

Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
 
How does the Cloud Foundry Diego Project Run at Scale?
How does the Cloud Foundry Diego Project Run at Scale?How does the Cloud Foundry Diego Project Run at Scale?
How does the Cloud Foundry Diego Project Run at Scale?
 
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...
 
Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
Performance
PerformancePerformance
Performance
 
Алексей Петров "PHP at Scale: Knowing enough to be dangerous!"
Алексей Петров "PHP at Scale: Knowing enough to be dangerous!"Алексей Петров "PHP at Scale: Knowing enough to be dangerous!"
Алексей Петров "PHP at Scale: Knowing enough to be dangerous!"
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...
Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...
Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
 
Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.com
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsight
 
Scaling HDFS for Exabyte Storage@twitter
Scaling HDFS for Exabyte Storage@twitterScaling HDFS for Exabyte Storage@twitter
Scaling HDFS for Exabyte Storage@twitter
 
Fact based monitoring
Fact based monitoringFact based monitoring
Fact based monitoring
 
Fact-Based Monitoring
Fact-Based MonitoringFact-Based Monitoring
Fact-Based Monitoring
 
Introduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqIntroduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seq
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)
 

Recently uploaded

Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 

Recently uploaded (20)

Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 

Fake It 'Til You Make It

Editor's Notes

  1. Lately, most of my time has been spent with these tools, and generally shifting to the right.
  2. We recently started using Heka instead of Logstash. It’s an experiment, but we’re embracing Go in some other development contexts, and we have not Ruby skills in house. Heka is interesting, but still a young project. I have high hopes for it from a performance perspective, and have had good interactions with the community for both Q & A and code acceptance.
  3. When sales or consultants come to me and ask how many nodes our product can support, I’d like to provide a better answer than the shrug emoji, so we’re going to walk through a modeling exercise.
  4. Let’s start with the big picture.
  5. Since the missing servers are compute nodes, they are most likely have a very similar profile as the other compute nodes.
  6. Query all the logs for 7 days, and do a date histogram on the timestamp with an interval of 1 second. With our amount of data, the aggregation took about 15 seconds. Your situation might be different, so you can adjust the lookback or the interval to suit your needs. We see that 99.9% of the time, there were no more than 64 messages per second, but there are a few outliers. I didn’t include the chart, but I think there were about 10 instances of a rate over 1000 messages/sec. We’ll want to consider those in a robust model.
  7. But for now, let’s remove them so we can see the shape of the message rate histogram. I’ve hand drawn in the red line to highlight what I think is a reasonable shape.
  8. Let’s do a similar analysis of the message size. Here’ well look at the last 7 days, but do a percentiles aggregation. We can see that our max message size was 782 bytes, but that 95% of messages were less than 240 byes. This is only the text component of the message, not the other fields like tags, timestamp, etc., but those things are pretty constant in this environment, so it should be ok.
  9. So, what did we learn out our environment?
  10. Now, we’ve got a bunch of ingredients, and we want to get down to the business of making a model.
  11. First, we might try to select a distribution that looks most like our model parameters. That’s pretty straight forward, and you can use tools like numpy and scipy to generate these curves, then take samples from the line to get a value for a parameter.
  12. You might want to add some noise to your parameters to give them some “texture”. For example, you might want to vary the cadence of the message stream by making a decision at each sample time about how many messages to generate, or increase or decrease the message size.
  13. And then all you need to do is execute flawlessly.
  14. But what happens when you left your not a professional data modeler, or your knives aren’t sharp, or people are pounding the forks and knives on the table… You probably end up with something like...
  15. Meatloaf. Arguably not horrible, but no Thomas Keller dish either.
  16. Let’s revisit what what we have to work with. Shippers, collectors, and a datastore. The shippers and collectors are distributed. In our case, the datastore is on the same node as the collector. It’s unlikely that an individual shipper will generate more data than the network, or an individual collector can handle, so we’ll focus on the flooding the collector from the local host, and see how the collector / elasticsearch combo handle things. There’s a chance that we can overwork the OS, but this still seems like a good starting point to isolate bottlenecks.
  17. Heka has a load testing utility called flood. I tend to build a mental model of what these types of utilities should do, then by the time I get a chance to prototype a solution using them, I can only hope they they are close to my mental model. In this case, I missed the mark a little, and hit a few other snags, but overall, I was able to demonstrate what I had hoped to.
  18. A simplified overview of what I found was that everything ramped up pretty well until I got over 5500 messages/sec, then things started to destabilize. The shippers started pausing for a few seconds at a time, then would restart, run for a while and pause again. I included two views of this behavior to show how the sample rate can mask or highlight system behavior. Marvel (on top) is aggregating over 10 seconds, Kibana (bottom) is aggregating over 1 second. You can see the destabilization in Marvel, but it’s much more obvious in Kibana.
  19. …. But on to our first model...
  20. … So, when someone asks if we can handle traffic from 100 controllers and 15000 compute nodes, I can say “The model says yes!”, but when they ask if we can handle 250 controllers and 10000 compute nodes, I can say “The model says “Oh, no. That would be a bad idea...”
  21. Well, we didn’t really find the bottleneck, which means we didn’t get the chance to tune it or scale it. We also have a very simple model using fairly simple tools. We could definitely improve that. Finally, we would want to take some real world samples to support or refute our model. In other words, we should continuously improvement via understanding our environment, tuning parameters, and validating our model with feedback from the real world.
  22. Since I paid for the right to use Meatloaf, he would like to say Thanks, and ask if you have any questions…