SlideShare a Scribd company logo
1 of 45
Introduction to
Capacity Planning
Brian Brazil
Founder
Who am I?
Engineer passionate about running software reliably in production.
Founder of Robust Perception, promoting operational efficiency through
Prometheus
Google SRE for 7 years, working on high-scale reliable systems such as Billing,
Adwords, Adsense, Ad Exchange, Database
Boxever TL Systems&Infrastructure, applied processes and technology to let
allow company to scale and reduce operational load
Contributor to many open source projects, including Prometheus, Ansible,
Python, Aurora and Zookeeper.
What’s this talk about?
At the end of the talk you will be able to:
Estimate how much spare capacity you have in less than 5 minutes
Estimate how much runway that capacity provides
Determine how many machines you need
Spot common potential problems as you scale
For a simple system this should set you up for your first 1-2 years, if not more
Audience
This talk is looking at the basics.
I assume that you:
Use Unix in production
Have a relatively simple setup
Don’t have a team doing this for you already
I’m also going to focus on webservices-type systems that mobile devices would
talk to, rather than offline processing or batch.
Capacity
Estimate your capacity in 3 easy steps!
1. Measure bottleneck resource at peak traffic
2. Divide to get fraction of limit
3. Divide into peak traffic
Estimate your capacity in 3 not so easy steps!
1. What’s your bottleneck? How do you measure it?
2. What’s your bottleneck’s limit?
3. What’s your peak traffic?
Step 1: What’s the bottleneck?
The most common bottlenecks:
1. CPU
2. Disk I/O
Less common: network, disk space, external resources, quotas, hardcoded limits,
contention/locking, memory, file descriptors, port numbers, humans
Step 1: Where’s the bottleneck?
Look at CPU % and Disk I/O Utilisation on each type of machine.
If you’ve monitoring, use that.
Failing that:
sudo apt-get install sysstat
iostat -x 5
Step 1: Iostat
avg-cpu: %user %nice %system %iowait %steal %idle
4.24 0.00 1.18 0.98 0.00 93.60
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 1.40 0.00 3.80 0.00 45.20 23.79 0.00 1.05 0.00 1.05 0.84 0.32
sdb 0.00 1.40 0.00 21.00 0.00 267.20 25.45 0.09 4.11 0.00 4.11 4.11 8.64
sdc 0.00 1.40 0.00 20.00 0.00 267.20 26.72 0.06 3.24 0.00 3.24 3.24 6.48
md0 0.00 0.00 0.00 2.00 0.00 8.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
The numbers you care about are %idle and %util.
%idle is the amount of CPU not in use. %util is the amount of disk I/O in use, take
the biggest one.
Step 2: What’s the limit?
We now know the CPU and disk I/O usage on each machine at peak.
Which is the bottleneck though?
Need to know the limit. Rules of thumb:
80% limit for CPU
50% limit for Disk I/O
Step 2: Division
Find how full each CPU and disk is.
Say we had a disk 10% utilised, and a CPU 20% utilised (80% idle).
0.1/0.5 = 0.2 => Disk IO is at 20% of limit
0.2/0.8 = 0.25 => CPU is at 25% of limit
CPU is our bottleneck, with 25% of capacity used.
Step 2: Utilisation Visualisation
Step 3: Peak traffic
Now that we know how full our bottleneck is, we need to know how much capacity
we have.
Figure out how much traffic you were handling around the time you measured cpu
and disk utilisation.
You might do this via monitoring, or parsing logs or if you’re really stuck tcpdump.
Step 3: The 2nd division
Let’s say our queries per second (qps) was 10 around peak.
Our CPU was our bottleneck, and about 25% of our limit.
10/0.25 = 40qps
So we can currently handle a maximum traffic of around 40qps
Step 3: Capacity Visualisation
Now you can estimate your capacity in 3 easy steps!
1. Measure bottleneck resource at peak traffic
Use monitoring or iostat to see how close you are to the limit, say 20% full
1. Divide to get fraction of limit
With a limit of 80% for CPU, you’re 20/80 = 25% full
1. Divide into peak traffic
Traffic was 10qps, so 10/0.25 = 40qps capacity
Runway
How much runway do you have?
You now have a rough idea of how much capacity you have to spare.
In the example here, we’re using 10qps out of 40qps capacity.
How long will that 30qps last you?
The two main factors are new customers and organic growth.
New Customers
New customers/partners are your main source of traffic.
Look at your traffic graphs around the time a new customer started using your
system.
If the customer had say 1M users and you saw 10qps increased peak traffic, you
can now predict how much traffic future customers will need.
Based on sales predictions, you can tell how much capacity you’ll need for new
customers.
Organic growth
Over time your existing customers/partners will use the system more and more,
new employees are hired, they get new customers etc.
Look at your monitoring’s traffic graphs over a few months to see what the trend is
like. Do your best to ignore the impact of launches.
Calculate your % growth month on month.
Starting out, it’s likely that organic growth will not be your main consideration.
Calculating runway
Once again in the example here, we’re using 10qps out of 40qps capacity.
Each 1M user customer generates 10qps of additional traffic.
You also expect a negligible amount of organic growth.
This means you can handle 3M more users worth of new customers.
If you’re signing up one 1M user customer per month, that gives you 3 months.
Provisioning
Provisioning vs Capacity Planning
Capacity Planning:
In 6 months I will have 7 new customers, and need to be able to handle 100qps in
total
Provisioning:
To handle 100qps I need X frontends and Y databases
Provisioning: What can a machine handle?
Continuing our example, let’s say we had 4 machines and each reported being at
CPU 20% (25% of the 80% limit) while dealing with 10qps each.
The key metric is qps per machine.
10qps/.2 machines = 50qps/machine
Can only safely use 80% of the machine, so 50*.8 = 40qps
So we can handle 40 qps per machine.
Provisioning: How many machines do I need?
If we want to handle 100qps, we need 100/40 = 2.5 machines. So 3 machines.
For each type of machine, calculate the incoming external qps it can handle and
how many you need.
Don’t fret about $10/month worth of cost, it’s not worth your time.
Provisioning: Visualisation
Review: The Basics
Estimating capacity:
Measure bottleneck at peak
Find how near bottleneck is to the limit
Calculate spare capacity based on peak traffic
Keep an eye on new customers/partners and organic growth to track runway
For provisioning, calculate qps/machine for each type of machine
Life is not Basic
A few wrinkles
I’ve glossed over a lot of detail so you can go away from today’s talk with
something you can immediately use.
Some questions ye may have:
Why measure at peak traffic?
What if I don’t have much traffic?
Why 80% limit on CPU and 50% on disk?
What if a machine fails?
What if things aren’t that simple?
Why measure at peak traffic?
As your utilisation increases:
● Latency increases
● Performance decreases
In addition skew due to
background of constant CPU
usage is decreased
Measuring at peak helps
allow for these factors.
Beware the Knee.
No really, Beware the Knee
Watch out for when 10%
more traffic results in
significantly more than a 10%
latency increase.
This means you’re getting
close to the knee, and if your
traffic increases much more
you could fall over.
What if I don’t have much traffic?
If you don’t have enough traffic to show up in top or iotop, then these techniques
won’t help you much.
You could loadtest, but that takes time. Or use rules of thumb.
Easier way: Use latency to estimate throughput.
If your queries take 10ms, then you can probably handle 100/s
Why 80% limit on CPU and 50% on disk?
For CPU due to utilisation/latency curve you want to avoid having too high
utilisation.
If you have the CPU to yourself 90-95% is safe in a controlled environment with
good loadtesting. This is uncommon, so leave safety margin for OS processes etc.
For spinning disks the impact of utilisation tends to be more problematic, and
background tasks tend to use a lot of disk.
What if a machine fails?
You generally should add 2 extra machines beyond that you need to serve peak
qps. This is commonly known as “n+2”.
This is to allow for one machine failure, and to let you take down a machine to
push a new binary, perform maintenance or whatever.
This also gives you some slack in your capacity. As you grow, more sophisticated
math is required.
What if things aren’t that simple?
Lots of other issues can throw a spanner in the works.
Heterogeneous machines
Varying machine performance
Varying traffic mixes
Multiple datacenters
Multi-tiered services
As a general rule try to keep things simple. A perfect model is brittle and usually
takes more time than it’s worth.
Mobile isn’t simple
Mobile networks are slow and unreliable.
Need to terminate and serve TCP at the edge.
SSL even worse, many round trips to setup connection.
All this means that for best performance you need global load balancing across
many relatively small sites. Have to balance cost versus user benefit.
HTTP/2 makes some of this better.
Doesn’t autoscaling take care of all this for me?
Short answer Long answer
Doesn’t autoscaling take care of all this for me?
Short answer
No
Long answer
Doesn’t autoscaling take care of all this for me?
Short answer
No
Long answer
Haha, Haha.
No
Doesn’t autoscaling take care of all this for me?
EC2 Autoscaling can eliminate some of the day-to-day work in provisioning
servers.
There’s operational and complexity overhead, as you have to maintain images
and systems that can be spun up.
You have to wait for instances to spin up - can’t rely on it completely for sudden
spikes. You need to do math to tune it to be able to handle a spikes.
You still have to tune everything. Control systems are hard.
Lots of small machines vs few big machines
There’s overhead on every machine, usually around 500MB RAM and .2-.5 CPUs
System daemons, logrotate and other cronjobs, space for ssh sessions,
monitoring agents etc.
Big machines save resources, and can share overprovisioning buffers for spikes.
Small machines can be easier to manage.
Big machines with a good cluster manager and scheduler is best.
Wrapping Up
Monitoring Matters
A common thread through this talk is that monitoring is what should be providing
you the information you need to make operational decisions.
Make sure you have a good monitoring system.
Logs are not monitoring, though better than nothing.
I recommend Prometheus.io: If it didn’t exist I would have created it.
Questions?
Blog: www.robustperception.io/blog
Twitter: @RobustPerceiver
Email: brian.brazil@robustperception.io
Linkedin: https://ie.linkedin.com/in/brianbrazil

More Related Content

What's hot

Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Brian Brazil
 
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Brian Brazil
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Brian Brazil
 
Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)Brian Brazil
 
Migrating to Prometheus: what we learned running it in production
Migrating to Prometheus: what we learned running it in productionMigrating to Prometheus: what we learned running it in production
Migrating to Prometheus: what we learned running it in productionMarco Pracucci
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Brian Brazil
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy Docker, Inc.
 
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
Evolution of the Prometheus TSDB  (Percona Live Europe 2017)Evolution of the Prometheus TSDB  (Percona Live Europe 2017)
Evolution of the Prometheus TSDB (Percona Live Europe 2017)Brian Brazil
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Brian Brazil
 
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Brian Brazil
 
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Brian Brazil
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Brian Brazil
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Brian Brazil
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Brian Brazil
 
Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Brian Brazil
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with PrometheusQAware GmbH
 
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)Brian Brazil
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
No C-QL (Or how I learned to stop worrying, and love eventual consistency) (N...
No C-QL (Or how I learned to stop worrying, and love eventual consistency) (N...No C-QL (Or how I learned to stop worrying, and love eventual consistency) (N...
No C-QL (Or how I learned to stop worrying, and love eventual consistency) (N...Brian Brazil
 

What's hot (20)

Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
 
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
 
Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)
 
Migrating to Prometheus: what we learned running it in production
Migrating to Prometheus: what we learned running it in productionMigrating to Prometheus: what we learned running it in production
Migrating to Prometheus: what we learned running it in production
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy
 
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
Evolution of the Prometheus TSDB  (Percona Live Europe 2017)Evolution of the Prometheus TSDB  (Percona Live Europe 2017)
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
 
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
 
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)
 
Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with Prometheus
 
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
No C-QL (Or how I learned to stop worrying, and love eventual consistency) (N...
No C-QL (Or how I learned to stop worrying, and love eventual consistency) (N...No C-QL (Or how I learned to stop worrying, and love eventual consistency) (N...
No C-QL (Or how I learned to stop worrying, and love eventual consistency) (N...
 

Viewers also liked

An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
An Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQLAn Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQLBrian Brazil
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Brian Brazil
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus OverviewBrian Brazil
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Brian Brazil
 
Cost savings and expert system advice with athene ES/1
Cost savings and expert system advice with athene ES/1 Cost savings and expert system advice with athene ES/1
Cost savings and expert system advice with athene ES/1 Metron
 
The Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterThe Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterChris Henry
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 
A bit of viral protection is worth a megabyte of cure
A bit of viral protection is worth a megabyte of cureA bit of viral protection is worth a megabyte of cure
A bit of viral protection is worth a megabyte of cureUltraUploader
 
Epistemología programa 2012
Epistemología  programa 2012Epistemología  programa 2012
Epistemología programa 2012HAV
 
Presentación club de artes marciales Un solo camino
Presentación club de artes marciales Un solo caminoPresentación club de artes marciales Un solo camino
Presentación club de artes marciales Un solo caminounsolocamino
 
Bases liceo municipal polivalente
Bases liceo municipal polivalenteBases liceo municipal polivalente
Bases liceo municipal polivalenteInstituto Imach
 

Viewers also liked (16)

An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
An Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQLAn Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQL
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
 
Cost savings and expert system advice with athene ES/1
Cost savings and expert system advice with athene ES/1 Cost savings and expert system advice with athene ES/1
Cost savings and expert system advice with athene ES/1
 
The Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterThe Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb Cluster
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
A bit of viral protection is worth a megabyte of cure
A bit of viral protection is worth a megabyte of cureA bit of viral protection is worth a megabyte of cure
A bit of viral protection is worth a megabyte of cure
 
Universalleuchte Sistronic Plus Batz Leuchtsysteme
Universalleuchte Sistronic Plus Batz LeuchtsystemeUniversalleuchte Sistronic Plus Batz Leuchtsysteme
Universalleuchte Sistronic Plus Batz Leuchtsysteme
 
Epistemología programa 2012
Epistemología  programa 2012Epistemología  programa 2012
Epistemología programa 2012
 
Presentación club de artes marciales Un solo camino
Presentación club de artes marciales Un solo caminoPresentación club de artes marciales Un solo camino
Presentación club de artes marciales Un solo camino
 
Bases liceo municipal polivalente
Bases liceo municipal polivalenteBases liceo municipal polivalente
Bases liceo municipal polivalente
 
Malware
MalwareMalware
Malware
 
borsen_print
borsen_printborsen_print
borsen_print
 
Site builder pymedia
Site builder pymediaSite builder pymedia
Site builder pymedia
 

Similar to Provisioning and Capacity Planning (Travel Meets Big Data)

Дмитро Волошин "High[Page]load"
Дмитро Волошин "High[Page]load"Дмитро Волошин "High[Page]load"
Дмитро Волошин "High[Page]load"Fwdays
 
Low latency in java 8 by Peter Lawrey
Low latency in java 8 by Peter Lawrey Low latency in java 8 by Peter Lawrey
Low latency in java 8 by Peter Lawrey J On The Beach
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesEd Hunter
 
Low latency in java 8 v5
Low latency in java 8 v5Low latency in java 8 v5
Low latency in java 8 v5Peter Lawrey
 
Albert Witteveen - With Cloud Computing Who Needs Performance Testing
Albert Witteveen - With Cloud Computing Who Needs Performance TestingAlbert Witteveen - With Cloud Computing Who Needs Performance Testing
Albert Witteveen - With Cloud Computing Who Needs Performance TestingTEST Huddle
 
Quick guide to plan and execute a load test
Quick guide to plan and execute a load testQuick guide to plan and execute a load test
Quick guide to plan and execute a load testduke.kalra
 
Cassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten YearsCassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten YearsJon Haddad
 
Magento performancenbs
Magento performancenbsMagento performancenbs
Magento performancenbsvarien
 
Ask the Expert: Lean Leadership - Can We Talk About OEE?
Ask the Expert: Lean Leadership - Can We Talk About OEE?Ask the Expert: Lean Leadership - Can We Talk About OEE?
Ask the Expert: Lean Leadership - Can We Talk About OEE?MileyJames
 
Who’s Minding the SSO Store?
Who’s Minding the SSO Store? Who’s Minding the SSO Store?
Who’s Minding the SSO Store? CA Technologies
 
Guide to alfresco monitoring
Guide to alfresco monitoringGuide to alfresco monitoring
Guide to alfresco monitoringMiguel Rodriguez
 
Supercharge Your Applications
Supercharge Your ApplicationsSupercharge Your Applications
Supercharge Your ApplicationsSean Boiling
 
London Web Performance Meetup: Performance for mortal companies
London Web Performance Meetup: Performance for mortal companiesLondon Web Performance Meetup: Performance for mortal companies
London Web Performance Meetup: Performance for mortal companiesStrangeloop
 
Early watch report
Early watch reportEarly watch report
Early watch reportcecileekove
 
Web Performance in the Age of HTTP2 - Topconf Tallinn 2016 - Holger Bartel
Web Performance in the Age of HTTP2 - Topconf Tallinn 2016 - Holger BartelWeb Performance in the Age of HTTP2 - Topconf Tallinn 2016 - Holger Bartel
Web Performance in the Age of HTTP2 - Topconf Tallinn 2016 - Holger BartelHolger Bartel
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And ScalabilityJason Ragsdale
 

Similar to Provisioning and Capacity Planning (Travel Meets Big Data) (20)

Дмитро Волошин "High[Page]load"
Дмитро Волошин "High[Page]load"Дмитро Волошин "High[Page]load"
Дмитро Волошин "High[Page]load"
 
Low latency in java 8 by Peter Lawrey
Low latency in java 8 by Peter Lawrey Low latency in java 8 by Peter Lawrey
Low latency in java 8 by Peter Lawrey
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
Mathworks CAE simulation suite – case in point from automotive and aerospace.
Mathworks CAE simulation suite – case in point from automotive and aerospace.Mathworks CAE simulation suite – case in point from automotive and aerospace.
Mathworks CAE simulation suite – case in point from automotive and aerospace.
 
Low latency in java 8 v5
Low latency in java 8 v5Low latency in java 8 v5
Low latency in java 8 v5
 
Albert Witteveen - With Cloud Computing Who Needs Performance Testing
Albert Witteveen - With Cloud Computing Who Needs Performance TestingAlbert Witteveen - With Cloud Computing Who Needs Performance Testing
Albert Witteveen - With Cloud Computing Who Needs Performance Testing
 
Quick guide to plan and execute a load test
Quick guide to plan and execute a load testQuick guide to plan and execute a load test
Quick guide to plan and execute a load test
 
Cassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten YearsCassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten Years
 
Magento performancenbs
Magento performancenbsMagento performancenbs
Magento performancenbs
 
Ask the Expert: Lean Leadership - Can We Talk About OEE?
Ask the Expert: Lean Leadership - Can We Talk About OEE?Ask the Expert: Lean Leadership - Can We Talk About OEE?
Ask the Expert: Lean Leadership - Can We Talk About OEE?
 
Who’s Minding the SSO Store?
Who’s Minding the SSO Store? Who’s Minding the SSO Store?
Who’s Minding the SSO Store?
 
Guide to alfresco monitoring
Guide to alfresco monitoringGuide to alfresco monitoring
Guide to alfresco monitoring
 
Supercharge Your Applications
Supercharge Your ApplicationsSupercharge Your Applications
Supercharge Your Applications
 
6monitor_NYMIIS
6monitor_NYMIIS6monitor_NYMIIS
6monitor_NYMIIS
 
Ch24 system administration
Ch24 system administration Ch24 system administration
Ch24 system administration
 
Ch24
Ch24Ch24
Ch24
 
London Web Performance Meetup: Performance for mortal companies
London Web Performance Meetup: Performance for mortal companiesLondon Web Performance Meetup: Performance for mortal companies
London Web Performance Meetup: Performance for mortal companies
 
Early watch report
Early watch reportEarly watch report
Early watch report
 
Web Performance in the Age of HTTP2 - Topconf Tallinn 2016 - Holger Bartel
Web Performance in the Age of HTTP2 - Topconf Tallinn 2016 - Holger BartelWeb Performance in the Age of HTTP2 - Topconf Tallinn 2016 - Holger Bartel
Web Performance in the Age of HTTP2 - Topconf Tallinn 2016 - Holger Bartel
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And Scalability
 

Recently uploaded

Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Delhi Call girls
 
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya Shirtrahman018755
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girlsstephieert
 
horny (9316020077 ) Goa Call Girls Service by VIP Call Girls in Goa
horny (9316020077 ) Goa  Call Girls Service by VIP Call Girls in Goahorny (9316020077 ) Goa  Call Girls Service by VIP Call Girls in Goa
horny (9316020077 ) Goa Call Girls Service by VIP Call Girls in Goasexy call girls service in goa
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024APNIC
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLimonikaupta
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)Damian Radcliffe
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxellan12
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...Neha Pandey
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
SEO Growth Program-Digital optimization Specialist
SEO Growth Program-Digital optimization SpecialistSEO Growth Program-Digital optimization Specialist
SEO Growth Program-Digital optimization SpecialistKHM Anwar
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$kojalkojal131
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebJames Anderson
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.soniya singh
 
Networking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGNetworking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGAPNIC
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...tanu pandey
 

Recently uploaded (20)

Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
 
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girls
 
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
horny (9316020077 ) Goa Call Girls Service by VIP Call Girls in Goa
horny (9316020077 ) Goa  Call Girls Service by VIP Call Girls in Goahorny (9316020077 ) Goa  Call Girls Service by VIP Call Girls in Goa
horny (9316020077 ) Goa Call Girls Service by VIP Call Girls in Goa
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
SEO Growth Program-Digital optimization Specialist
SEO Growth Program-Digital optimization SpecialistSEO Growth Program-Digital optimization Specialist
SEO Growth Program-Digital optimization Specialist
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
 
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
Networking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGNetworking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOG
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 

Provisioning and Capacity Planning (Travel Meets Big Data)

  • 2. Who am I? Engineer passionate about running software reliably in production. Founder of Robust Perception, promoting operational efficiency through Prometheus Google SRE for 7 years, working on high-scale reliable systems such as Billing, Adwords, Adsense, Ad Exchange, Database Boxever TL Systems&Infrastructure, applied processes and technology to let allow company to scale and reduce operational load Contributor to many open source projects, including Prometheus, Ansible, Python, Aurora and Zookeeper.
  • 3. What’s this talk about? At the end of the talk you will be able to: Estimate how much spare capacity you have in less than 5 minutes Estimate how much runway that capacity provides Determine how many machines you need Spot common potential problems as you scale For a simple system this should set you up for your first 1-2 years, if not more
  • 4. Audience This talk is looking at the basics. I assume that you: Use Unix in production Have a relatively simple setup Don’t have a team doing this for you already I’m also going to focus on webservices-type systems that mobile devices would talk to, rather than offline processing or batch.
  • 6. Estimate your capacity in 3 easy steps! 1. Measure bottleneck resource at peak traffic 2. Divide to get fraction of limit 3. Divide into peak traffic
  • 7. Estimate your capacity in 3 not so easy steps! 1. What’s your bottleneck? How do you measure it? 2. What’s your bottleneck’s limit? 3. What’s your peak traffic?
  • 8. Step 1: What’s the bottleneck? The most common bottlenecks: 1. CPU 2. Disk I/O Less common: network, disk space, external resources, quotas, hardcoded limits, contention/locking, memory, file descriptors, port numbers, humans
  • 9. Step 1: Where’s the bottleneck? Look at CPU % and Disk I/O Utilisation on each type of machine. If you’ve monitoring, use that. Failing that: sudo apt-get install sysstat iostat -x 5
  • 10. Step 1: Iostat avg-cpu: %user %nice %system %iowait %steal %idle 4.24 0.00 1.18 0.98 0.00 93.60 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 1.40 0.00 3.80 0.00 45.20 23.79 0.00 1.05 0.00 1.05 0.84 0.32 sdb 0.00 1.40 0.00 21.00 0.00 267.20 25.45 0.09 4.11 0.00 4.11 4.11 8.64 sdc 0.00 1.40 0.00 20.00 0.00 267.20 26.72 0.06 3.24 0.00 3.24 3.24 6.48 md0 0.00 0.00 0.00 2.00 0.00 8.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00 The numbers you care about are %idle and %util. %idle is the amount of CPU not in use. %util is the amount of disk I/O in use, take the biggest one.
  • 11. Step 2: What’s the limit? We now know the CPU and disk I/O usage on each machine at peak. Which is the bottleneck though? Need to know the limit. Rules of thumb: 80% limit for CPU 50% limit for Disk I/O
  • 12. Step 2: Division Find how full each CPU and disk is. Say we had a disk 10% utilised, and a CPU 20% utilised (80% idle). 0.1/0.5 = 0.2 => Disk IO is at 20% of limit 0.2/0.8 = 0.25 => CPU is at 25% of limit CPU is our bottleneck, with 25% of capacity used.
  • 13. Step 2: Utilisation Visualisation
  • 14. Step 3: Peak traffic Now that we know how full our bottleneck is, we need to know how much capacity we have. Figure out how much traffic you were handling around the time you measured cpu and disk utilisation. You might do this via monitoring, or parsing logs or if you’re really stuck tcpdump.
  • 15. Step 3: The 2nd division Let’s say our queries per second (qps) was 10 around peak. Our CPU was our bottleneck, and about 25% of our limit. 10/0.25 = 40qps So we can currently handle a maximum traffic of around 40qps
  • 16. Step 3: Capacity Visualisation
  • 17. Now you can estimate your capacity in 3 easy steps! 1. Measure bottleneck resource at peak traffic Use monitoring or iostat to see how close you are to the limit, say 20% full 1. Divide to get fraction of limit With a limit of 80% for CPU, you’re 20/80 = 25% full 1. Divide into peak traffic Traffic was 10qps, so 10/0.25 = 40qps capacity
  • 19. How much runway do you have? You now have a rough idea of how much capacity you have to spare. In the example here, we’re using 10qps out of 40qps capacity. How long will that 30qps last you? The two main factors are new customers and organic growth.
  • 20. New Customers New customers/partners are your main source of traffic. Look at your traffic graphs around the time a new customer started using your system. If the customer had say 1M users and you saw 10qps increased peak traffic, you can now predict how much traffic future customers will need. Based on sales predictions, you can tell how much capacity you’ll need for new customers.
  • 21. Organic growth Over time your existing customers/partners will use the system more and more, new employees are hired, they get new customers etc. Look at your monitoring’s traffic graphs over a few months to see what the trend is like. Do your best to ignore the impact of launches. Calculate your % growth month on month. Starting out, it’s likely that organic growth will not be your main consideration.
  • 22. Calculating runway Once again in the example here, we’re using 10qps out of 40qps capacity. Each 1M user customer generates 10qps of additional traffic. You also expect a negligible amount of organic growth. This means you can handle 3M more users worth of new customers. If you’re signing up one 1M user customer per month, that gives you 3 months.
  • 24. Provisioning vs Capacity Planning Capacity Planning: In 6 months I will have 7 new customers, and need to be able to handle 100qps in total Provisioning: To handle 100qps I need X frontends and Y databases
  • 25. Provisioning: What can a machine handle? Continuing our example, let’s say we had 4 machines and each reported being at CPU 20% (25% of the 80% limit) while dealing with 10qps each. The key metric is qps per machine. 10qps/.2 machines = 50qps/machine Can only safely use 80% of the machine, so 50*.8 = 40qps So we can handle 40 qps per machine.
  • 26. Provisioning: How many machines do I need? If we want to handle 100qps, we need 100/40 = 2.5 machines. So 3 machines. For each type of machine, calculate the incoming external qps it can handle and how many you need. Don’t fret about $10/month worth of cost, it’s not worth your time.
  • 28. Review: The Basics Estimating capacity: Measure bottleneck at peak Find how near bottleneck is to the limit Calculate spare capacity based on peak traffic Keep an eye on new customers/partners and organic growth to track runway For provisioning, calculate qps/machine for each type of machine
  • 29. Life is not Basic
  • 30. A few wrinkles I’ve glossed over a lot of detail so you can go away from today’s talk with something you can immediately use. Some questions ye may have: Why measure at peak traffic? What if I don’t have much traffic? Why 80% limit on CPU and 50% on disk? What if a machine fails? What if things aren’t that simple?
  • 31. Why measure at peak traffic? As your utilisation increases: ● Latency increases ● Performance decreases In addition skew due to background of constant CPU usage is decreased Measuring at peak helps allow for these factors. Beware the Knee.
  • 32. No really, Beware the Knee Watch out for when 10% more traffic results in significantly more than a 10% latency increase. This means you’re getting close to the knee, and if your traffic increases much more you could fall over.
  • 33. What if I don’t have much traffic? If you don’t have enough traffic to show up in top or iotop, then these techniques won’t help you much. You could loadtest, but that takes time. Or use rules of thumb. Easier way: Use latency to estimate throughput. If your queries take 10ms, then you can probably handle 100/s
  • 34. Why 80% limit on CPU and 50% on disk? For CPU due to utilisation/latency curve you want to avoid having too high utilisation. If you have the CPU to yourself 90-95% is safe in a controlled environment with good loadtesting. This is uncommon, so leave safety margin for OS processes etc. For spinning disks the impact of utilisation tends to be more problematic, and background tasks tend to use a lot of disk.
  • 35. What if a machine fails? You generally should add 2 extra machines beyond that you need to serve peak qps. This is commonly known as “n+2”. This is to allow for one machine failure, and to let you take down a machine to push a new binary, perform maintenance or whatever. This also gives you some slack in your capacity. As you grow, more sophisticated math is required.
  • 36. What if things aren’t that simple? Lots of other issues can throw a spanner in the works. Heterogeneous machines Varying machine performance Varying traffic mixes Multiple datacenters Multi-tiered services As a general rule try to keep things simple. A perfect model is brittle and usually takes more time than it’s worth.
  • 37. Mobile isn’t simple Mobile networks are slow and unreliable. Need to terminate and serve TCP at the edge. SSL even worse, many round trips to setup connection. All this means that for best performance you need global load balancing across many relatively small sites. Have to balance cost versus user benefit. HTTP/2 makes some of this better.
  • 38. Doesn’t autoscaling take care of all this for me? Short answer Long answer
  • 39. Doesn’t autoscaling take care of all this for me? Short answer No Long answer
  • 40. Doesn’t autoscaling take care of all this for me? Short answer No Long answer Haha, Haha. No
  • 41. Doesn’t autoscaling take care of all this for me? EC2 Autoscaling can eliminate some of the day-to-day work in provisioning servers. There’s operational and complexity overhead, as you have to maintain images and systems that can be spun up. You have to wait for instances to spin up - can’t rely on it completely for sudden spikes. You need to do math to tune it to be able to handle a spikes. You still have to tune everything. Control systems are hard.
  • 42. Lots of small machines vs few big machines There’s overhead on every machine, usually around 500MB RAM and .2-.5 CPUs System daemons, logrotate and other cronjobs, space for ssh sessions, monitoring agents etc. Big machines save resources, and can share overprovisioning buffers for spikes. Small machines can be easier to manage. Big machines with a good cluster manager and scheduler is best.
  • 44. Monitoring Matters A common thread through this talk is that monitoring is what should be providing you the information you need to make operational decisions. Make sure you have a good monitoring system. Logs are not monitoring, though better than nothing. I recommend Prometheus.io: If it didn’t exist I would have created it.
  • 45. Questions? Blog: www.robustperception.io/blog Twitter: @RobustPerceiver Email: brian.brazil@robustperception.io Linkedin: https://ie.linkedin.com/in/brianbrazil

Editor's Notes

  1. Exact command will vary across unices.
  2. Give it a few iterations so you can see what the typical numbers are.
  3. Latency increases due to queuing theory. Performance decreases due to locking and contention.
  4. You can’t choose when a machine will fail, so you can’t merge the one machine for failure and the one for operations.
  5. Can your software&config update process handle machines appearing and disappearing? Cascading overload is a risk with spikes when you depend on gradually adding new instances, rather than always having capacity ready Have to keep an eye on non-autoscaled systems too