SlideShare a Scribd company logo
Beyond 1000 BOSH Deployments
2
~/
→ Sven
● At anynines since 2015
● Ordained Minister
● Working with Cloud Foundry, BOSH and K8s
● Twitter @ShalahAllier
3
Also Known For
Mentor in the EVE: Online Alliance Test Alliance Please Ignore
Also where my slack avatar comes from
All the office plants
4
The Road to 1200 Deployments
5
The a9s Data Service Framework
6
BOSH Setup
Overbosh: Deploys runtime and Underbosh
Underbosh: Deploys service brokers and services (where we have 1200
deployments on) uses credhub colocated on overbosh
Utilsbosh: Utilities and Prometheus monitoring
7
Lessons Learned
● Deploy less with create-env
● Only create-env utilsbosh
● Deploy other directors with Utilsbosh
● Using Overbosh as Credhub provider for Underbosh can be
suboptimal
○ Recreate of the Overbosh means people cannot create services
○ Dedicated Credhub/UAA deployment better
● Using an external RDS does not solve all problems
○ More about that later
8
Running it
9
IO Credits are Fun
● AWS has limited IOPS for ssd disks at a rate of 3 IOPS/GB on gp2
● None on magnetic volumes (st1, sc1, standard)
● AWS-Stage BOSH DB runs on gp2
● AWS-Prod BOSH DB runs on standard
● You can see disk IOPS budget on the disk in cloud watch
● Unless its an RDS instance
● You have to create an alert for each single volume
10
Effects
● Database in AWS-P is consistently slower, but no variation in answer
times
● Database in AWS-S went unresponsive at some points
○ BOSH sometimes sends a few thousand requests in which do large joins
● EU-P BOSH has 50GB of standard disk (3$/mo)
● EU-S BOSH has 1TB of GP2 disk (119$/mo)
11
Things That Drain Your IOPS
● The daily snapshot task, even if snapshots are disabled
○ Made less severe
● Bosh vms/deployments
○ More later
● If your IOPS on the director disk get depleted repeatedly, take a
magnetic storage like sc1 or st1
○ Slower than gp2 at max speed
○ Costs half
○ Consistent and fast enough for BOSH
12
Some issues we had
13
September 2018
● 670 Deployments
● BOSH director is very slow
● Some queries take 2-3 minutes to complete
● Scaling BOSH and DB does only bring minor reductions
● M4.2xlarge RDS is a bit faster, but does not solve it
● More disk IOPS does not help
14
Solution
● Updating the director
● Reason was that every bosh vms made the bosh also select
deployment configs for each deployment separately
○ Even though it was not part of the output
● SAP stumbled over the issue first and fixed it
15
November 2018
● BOSH unresponsive or very slow
● No uploads/deploys possible
● Persistent disk 50% free
● “df -i” showed all inodes exhausted
● BOSH stores task logs on disk
○ And deletes regularly
○ If you have 900 deployments and prometheus bosh exporter does a bosh vms every 5
minutes you create tasks faster than bosh cleans them up
○ 1.8m task log folders on disk
○ Every one contained 0-3 log files
16
Solution
● Removing some older log files (1.79m)
● Scaling the disk
● Notifying BOSH core
● Set up alert for Inode usage on all persistent disks
● Switch from bosh exporter to graphite hm plugin
● BOSH core made the director more aggressive at purging old task
logs
○ Went from 1.6m task logs on disk to just 18.000
17
December 2018
● BOSH very slow
● Sometimes locks up for minutes
● Database works on some queries longer than BOSH waits
● Whenever a service is deployed or updated
18
Investigation
● Turns out when you use `bosh tasks -r` it queries the last 30 tasks
● We had 3.5m tasks in the DB
● Query: `SELECT * FROM “tasks” WHERE (“deployment_name” =
‘d27eda6’) ORDER BY “timestamp” DESC LIMIT 30`
○ No index on deployment_name
○ So if only 29 tasks are there it crawls through all 3.5m lines to find task 30.
○ Most deployments have less than 30 tasks in the DB
19
Solution
● Change to -r=1
● Make a deploy task for each deployment to make sure there is one
task
● Issue with BOSH core (No. 2105)
● BOSH core fix:
○ BOSH deletes old tasks faster so you have less (10 instead of 2 in each run)
○ Put an index on task types
○ 3.5m tasks > 1100 tasks in the DB
20
Monitoring
21
Things You Should Monitor
● Network IP exhaustion
○ IaaS dependent, but running out of IPs during deploys is suboptimal
○ Especially when customer notices first
● Disk IOPS (depending on IaaS)
● Quota limitations
○ Record holder is azure where a limit increase took 9 days
● CPU credits on important instances
● Disk inode usage, not just how full it is in terms of data
● Certificate expiration
● Check if metrics are missing
22
What 1200 deployments taught us
● BOSH team is usually rather fast at fixing issues that block the
director
● BOSH itself is pretty stable
● Change from the Prometheus bosh exporter to the graphite hm plugin
● For most smaller to medium environments t2.large (2cpu, 8GB ram
with burst CPU) or equivalent is plenty
● For large environments a m5.xlarge or m5.2xlarge is enough
○ Disk IO/Network speed will most likely be the bottleneck
23
Advice
● Don’t overdo it on the worker count
○ Our biggest director still has only 9 workers for tasks
○ The others have usually 3-4 workers
● Otherwise you run the risk of CPU starving yourself when you use all
workers simultaneously
24
The End
25
Keep in Touch
anynines.com
@anynines
26
@ShalahAllier
Questions?
27

More Related Content

What's hot

Logical Replication in PostgreSQL - FLOSSUK 2016
Logical Replication in PostgreSQL - FLOSSUK 2016Logical Replication in PostgreSQL - FLOSSUK 2016
Logical Replication in PostgreSQL - FLOSSUK 2016
Petr Jelinek
 
Accordion HBaseCon 2017
Accordion HBaseCon 2017Accordion HBaseCon 2017
Accordion HBaseCon 2017
Edward Bortnikov
 
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Experts, Inc.
 
Application Caching: The Hidden Microservice (SAConf)
Application Caching: The Hidden Microservice (SAConf)Application Caching: The Hidden Microservice (SAConf)
Application Caching: The Hidden Microservice (SAConf)
Scott Mansfield
 
Unikraft: Fast, Specialized Unikernels the Easy Way
Unikraft: Fast, Specialized Unikernels the Easy WayUnikraft: Fast, Specialized Unikernels the Easy Way
Unikraft: Fast, Specialized Unikernels the Easy Way
ScyllaDB
 
EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)
Scott Mansfield
 
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Masao Fujii
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
PostgreSQL Experts, Inc.
 
Redo log
Redo logRedo log
Redo log
PaweOlchawa1
 
Dw tpain - Gordon Klok
Dw tpain - Gordon KlokDw tpain - Gordon Klok
Dw tpain - Gordon KlokDevopsdays
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
Ceph Community
 
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 InstanceExtreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
ScyllaDB
 
how_to_move_your_website_without_chaos-ralf_schwoebel.ppt
how_to_move_your_website_without_chaos-ralf_schwoebel.ppthow_to_move_your_website_without_chaos-ralf_schwoebel.ppt
how_to_move_your_website_without_chaos-ralf_schwoebel.pptzachbrowne
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
ScyllaDB
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMihbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMi
HBaseCon
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
Gluster.org
 
FreeNAS backup solution
FreeNAS backup solutionFreeNAS backup solution
FreeNAS backup solution
a3
 
G1: To Infinity and Beyond
G1: To Infinity and BeyondG1: To Infinity and Beyond
G1: To Infinity and Beyond
ScyllaDB
 
State of Gluster Performance
State of Gluster PerformanceState of Gluster Performance
State of Gluster Performance
Gluster.org
 

What's hot (20)

Logical Replication in PostgreSQL - FLOSSUK 2016
Logical Replication in PostgreSQL - FLOSSUK 2016Logical Replication in PostgreSQL - FLOSSUK 2016
Logical Replication in PostgreSQL - FLOSSUK 2016
 
Accordion HBaseCon 2017
Accordion HBaseCon 2017Accordion HBaseCon 2017
Accordion HBaseCon 2017
 
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALE
 
Application Caching: The Hidden Microservice (SAConf)
Application Caching: The Hidden Microservice (SAConf)Application Caching: The Hidden Microservice (SAConf)
Application Caching: The Hidden Microservice (SAConf)
 
Unikraft: Fast, Specialized Unikernels the Easy Way
Unikraft: Fast, Specialized Unikernels the Easy WayUnikraft: Fast, Specialized Unikernels the Easy Way
Unikraft: Fast, Specialized Unikernels the Easy Way
 
EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)
 
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
 
Redo log
Redo logRedo log
Redo log
 
Dw tpain - Gordon Klok
Dw tpain - Gordon KlokDw tpain - Gordon Klok
Dw tpain - Gordon Klok
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 InstanceExtreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
 
how_to_move_your_website_without_chaos-ralf_schwoebel.ppt
how_to_move_your_website_without_chaos-ralf_schwoebel.ppthow_to_move_your_website_without_chaos-ralf_schwoebel.ppt
how_to_move_your_website_without_chaos-ralf_schwoebel.ppt
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMihbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMi
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
 
FreeNAS backup solution
FreeNAS backup solutionFreeNAS backup solution
FreeNAS backup solution
 
G1: To Infinity and Beyond
G1: To Infinity and BeyondG1: To Infinity and Beyond
G1: To Infinity and Beyond
 
State of Gluster Performance
State of Gluster PerformanceState of Gluster Performance
State of Gluster Performance
 

Similar to Beyond 1000 bosh Deployments

PTD and beyond
PTD and beyondPTD and beyond
PTD and beyond
Johan Gustavsson
 
Percona XtraBackup - New Features and Improvements
Percona XtraBackup - New Features and ImprovementsPercona XtraBackup - New Features and Improvements
Percona XtraBackup - New Features and Improvements
Marcelo Altmann
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
OVHcloud
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PingCAP
 
Faster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDBFaster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDB
MariaDB plc
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
confluent
 
HBaseConAsia2018 Keynote1: Apache HBase Project Status
HBaseConAsia2018 Keynote1: Apache HBase Project StatusHBaseConAsia2018 Keynote1: Apache HBase Project Status
HBaseConAsia2018 Keynote1: Apache HBase Project Status
Michael Stack
 
Mongodb meetup
Mongodb meetupMongodb meetup
Mongodb meetup
Eytan Daniyalzade
 
Redis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs Talks
Redis Labs
 
OpenNebulaConf 2013 - How Can OpenNebula Fit Your Needs: A European Project F...
OpenNebulaConf 2013 - How Can OpenNebula Fit Your Needs: A European Project F...OpenNebulaConf 2013 - How Can OpenNebula Fit Your Needs: A European Project F...
OpenNebulaConf 2013 - How Can OpenNebula Fit Your Needs: A European Project F...
OpenNebula Project
 
How Can OpenNebula Fit Your Needs: A European Project Feedback
How Can OpenNebula Fit Your Needs: A European Project FeedbackHow Can OpenNebula Fit Your Needs: A European Project Feedback
How Can OpenNebula Fit Your Needs: A European Project Feedback
NETWAYS
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
ShapeBlue
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community
 
#dd12 Performance Boost for your IBM Lotus Notes Client
#dd12  Performance Boost for your IBM Lotus Notes Client#dd12  Performance Boost for your IBM Lotus Notes Client
#dd12 Performance Boost for your IBM Lotus Notes Client
Dominopoint - Italian Lotus User Group
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodb
Wei Shan Ang
 
Red Hat Gluster Storage Performance
Red Hat Gluster Storage PerformanceRed Hat Gluster Storage Performance
Red Hat Gluster Storage Performance
Red_Hat_Storage
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
Deep Kapadia
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC
 
Object Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldObject Compaction in Cloud for High Yield
Object Compaction in Cloud for High Yield
ScyllaDB
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
Junping Du
 

Similar to Beyond 1000 bosh Deployments (20)

PTD and beyond
PTD and beyondPTD and beyond
PTD and beyond
 
Percona XtraBackup - New Features and Improvements
Percona XtraBackup - New Features and ImprovementsPercona XtraBackup - New Features and Improvements
Percona XtraBackup - New Features and Improvements
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
Faster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDBFaster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDB
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
 
HBaseConAsia2018 Keynote1: Apache HBase Project Status
HBaseConAsia2018 Keynote1: Apache HBase Project StatusHBaseConAsia2018 Keynote1: Apache HBase Project Status
HBaseConAsia2018 Keynote1: Apache HBase Project Status
 
Mongodb meetup
Mongodb meetupMongodb meetup
Mongodb meetup
 
Redis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs Talks
 
OpenNebulaConf 2013 - How Can OpenNebula Fit Your Needs: A European Project F...
OpenNebulaConf 2013 - How Can OpenNebula Fit Your Needs: A European Project F...OpenNebulaConf 2013 - How Can OpenNebula Fit Your Needs: A European Project F...
OpenNebulaConf 2013 - How Can OpenNebula Fit Your Needs: A European Project F...
 
How Can OpenNebula Fit Your Needs: A European Project Feedback
How Can OpenNebula Fit Your Needs: A European Project FeedbackHow Can OpenNebula Fit Your Needs: A European Project Feedback
How Can OpenNebula Fit Your Needs: A European Project Feedback
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
#dd12 Performance Boost for your IBM Lotus Notes Client
#dd12  Performance Boost for your IBM Lotus Notes Client#dd12  Performance Boost for your IBM Lotus Notes Client
#dd12 Performance Boost for your IBM Lotus Notes Client
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodb
 
Red Hat Gluster Storage Performance
Red Hat Gluster Storage PerformanceRed Hat Gluster Storage Performance
Red Hat Gluster Storage Performance
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
Object Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldObject Compaction in Cloud for High Yield
Object Compaction in Cloud for High Yield
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 

More from anynines GmbH

Automating the Entire PostgreSQL Lifecycle
Automating the Entire PostgreSQL Lifecycle Automating the Entire PostgreSQL Lifecycle
Automating the Entire PostgreSQL Lifecycle
anynines GmbH
 
Kill Your Productivity - As Efficient as Possible
Kill Your Productivity - As Efficient as PossibleKill Your Productivity - As Efficient as Possible
Kill Your Productivity - As Efficient as Possible
anynines GmbH
 
An Introduction into Bosh | anynines
An Introduction into Bosh | anynines An Introduction into Bosh | anynines
An Introduction into Bosh | anynines
anynines GmbH
 
Digital Transformation Case Study | anynines
Digital Transformation Case Study | anynines Digital Transformation Case Study | anynines
Digital Transformation Case Study | anynines
anynines GmbH
 
Docker & Diego - good friends or not? | anynines
Docker & Diego  - good friends or not? | anyninesDocker & Diego  - good friends or not? | anynines
Docker & Diego - good friends or not? | anynines
anynines GmbH
 
Experience Report: Cloud Foundry Open Source Operations | anynines
Experience Report: Cloud Foundry Open Source Operations | anyninesExperience Report: Cloud Foundry Open Source Operations | anynines
Experience Report: Cloud Foundry Open Source Operations | anynines
anynines GmbH
 
Delivering a production Cloud Foundry Environment with Bosh | anynines
Delivering a production Cloud Foundry Environment with Bosh | anyninesDelivering a production Cloud Foundry Environment with Bosh | anynines
Delivering a production Cloud Foundry Environment with Bosh | anynines
anynines GmbH
 
Building a Production Grade PostgreSQL Cloud Foundry Service | anynines
Building a Production Grade PostgreSQL Cloud Foundry Service  | anyninesBuilding a Production Grade PostgreSQL Cloud Foundry Service  | anynines
Building a Production Grade PostgreSQL Cloud Foundry Service | anynines
anynines GmbH
 
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
anynines GmbH
 
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anyninesCloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
anynines GmbH
 
Cloud infrastructures - Slide Set 6 - BOSH | anynines
Cloud infrastructures - Slide Set 6 - BOSH | anyninesCloud infrastructures - Slide Set 6 - BOSH | anynines
Cloud infrastructures - Slide Set 6 - BOSH | anynines
anynines GmbH
 
Vorlesung - Cloud Infrastrukturen - OpenStack Part 1 | anynines
Vorlesung - Cloud Infrastrukturen - OpenStack Part 1 | anyninesVorlesung - Cloud Infrastrukturen - OpenStack Part 1 | anynines
Vorlesung - Cloud Infrastrukturen - OpenStack Part 1 | anyninesanynines GmbH
 
Vorlesung - Cloud Infrastrukturen - Clusterbau | anynines
Vorlesung - Cloud Infrastrukturen - Clusterbau  | anyninesVorlesung - Cloud Infrastrukturen - Clusterbau  | anynines
Vorlesung - Cloud Infrastrukturen - Clusterbau | anynines
anynines GmbH
 
Vorlesung - Cloud Infrastrukturen - Einleitung | anynines
Vorlesung - Cloud Infrastrukturen - Einleitung | anyninesVorlesung - Cloud Infrastrukturen - Einleitung | anynines
Vorlesung - Cloud Infrastrukturen - Einleitung | anynines
anynines GmbH
 
Introduction into Cloud Foundry and Bosh | anynines
Introduction into Cloud Foundry and Bosh | anyninesIntroduction into Cloud Foundry and Bosh | anynines
Introduction into Cloud Foundry and Bosh | anynines
anynines GmbH
 
Running Cloud Foundry for 12 months - An experience report | anynines
Running Cloud Foundry for 12 months - An experience report | anyninesRunning Cloud Foundry for 12 months - An experience report | anynines
Running Cloud Foundry for 12 months - An experience report | anynines
anynines GmbH
 
Cloud Foundry on OpenStack - An Experience Report | anynines
Cloud Foundry on OpenStack - An Experience Report | anynines Cloud Foundry on OpenStack - An Experience Report | anynines
Cloud Foundry on OpenStack - An Experience Report | anynines anynines GmbH
 
NSA - No thanks - Build your own cloud with OpenStack and Cloud Foundry | any...
NSA - No thanks - Build your own cloud with OpenStack and Cloud Foundry | any...NSA - No thanks - Build your own cloud with OpenStack and Cloud Foundry | any...
NSA - No thanks - Build your own cloud with OpenStack and Cloud Foundry | any...
anynines GmbH
 
Migrating a Cloud Foundry from VMware to OpenStack | anynines
Migrating a Cloud Foundry from VMware to OpenStack | anyninesMigrating a Cloud Foundry from VMware to OpenStack | anynines
Migrating a Cloud Foundry from VMware to OpenStack | anyninesanynines GmbH
 
Building a European PaaS | anynines
Building a European PaaS | anyninesBuilding a European PaaS | anynines
Building a European PaaS | anynines
anynines GmbH
 

More from anynines GmbH (20)

Automating the Entire PostgreSQL Lifecycle
Automating the Entire PostgreSQL Lifecycle Automating the Entire PostgreSQL Lifecycle
Automating the Entire PostgreSQL Lifecycle
 
Kill Your Productivity - As Efficient as Possible
Kill Your Productivity - As Efficient as PossibleKill Your Productivity - As Efficient as Possible
Kill Your Productivity - As Efficient as Possible
 
An Introduction into Bosh | anynines
An Introduction into Bosh | anynines An Introduction into Bosh | anynines
An Introduction into Bosh | anynines
 
Digital Transformation Case Study | anynines
Digital Transformation Case Study | anynines Digital Transformation Case Study | anynines
Digital Transformation Case Study | anynines
 
Docker & Diego - good friends or not? | anynines
Docker & Diego  - good friends or not? | anyninesDocker & Diego  - good friends or not? | anynines
Docker & Diego - good friends or not? | anynines
 
Experience Report: Cloud Foundry Open Source Operations | anynines
Experience Report: Cloud Foundry Open Source Operations | anyninesExperience Report: Cloud Foundry Open Source Operations | anynines
Experience Report: Cloud Foundry Open Source Operations | anynines
 
Delivering a production Cloud Foundry Environment with Bosh | anynines
Delivering a production Cloud Foundry Environment with Bosh | anyninesDelivering a production Cloud Foundry Environment with Bosh | anynines
Delivering a production Cloud Foundry Environment with Bosh | anynines
 
Building a Production Grade PostgreSQL Cloud Foundry Service | anynines
Building a Production Grade PostgreSQL Cloud Foundry Service  | anyninesBuilding a Production Grade PostgreSQL Cloud Foundry Service  | anynines
Building a Production Grade PostgreSQL Cloud Foundry Service | anynines
 
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
 
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anyninesCloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
 
Cloud infrastructures - Slide Set 6 - BOSH | anynines
Cloud infrastructures - Slide Set 6 - BOSH | anyninesCloud infrastructures - Slide Set 6 - BOSH | anynines
Cloud infrastructures - Slide Set 6 - BOSH | anynines
 
Vorlesung - Cloud Infrastrukturen - OpenStack Part 1 | anynines
Vorlesung - Cloud Infrastrukturen - OpenStack Part 1 | anyninesVorlesung - Cloud Infrastrukturen - OpenStack Part 1 | anynines
Vorlesung - Cloud Infrastrukturen - OpenStack Part 1 | anynines
 
Vorlesung - Cloud Infrastrukturen - Clusterbau | anynines
Vorlesung - Cloud Infrastrukturen - Clusterbau  | anyninesVorlesung - Cloud Infrastrukturen - Clusterbau  | anynines
Vorlesung - Cloud Infrastrukturen - Clusterbau | anynines
 
Vorlesung - Cloud Infrastrukturen - Einleitung | anynines
Vorlesung - Cloud Infrastrukturen - Einleitung | anyninesVorlesung - Cloud Infrastrukturen - Einleitung | anynines
Vorlesung - Cloud Infrastrukturen - Einleitung | anynines
 
Introduction into Cloud Foundry and Bosh | anynines
Introduction into Cloud Foundry and Bosh | anyninesIntroduction into Cloud Foundry and Bosh | anynines
Introduction into Cloud Foundry and Bosh | anynines
 
Running Cloud Foundry for 12 months - An experience report | anynines
Running Cloud Foundry for 12 months - An experience report | anyninesRunning Cloud Foundry for 12 months - An experience report | anynines
Running Cloud Foundry for 12 months - An experience report | anynines
 
Cloud Foundry on OpenStack - An Experience Report | anynines
Cloud Foundry on OpenStack - An Experience Report | anynines Cloud Foundry on OpenStack - An Experience Report | anynines
Cloud Foundry on OpenStack - An Experience Report | anynines
 
NSA - No thanks - Build your own cloud with OpenStack and Cloud Foundry | any...
NSA - No thanks - Build your own cloud with OpenStack and Cloud Foundry | any...NSA - No thanks - Build your own cloud with OpenStack and Cloud Foundry | any...
NSA - No thanks - Build your own cloud with OpenStack and Cloud Foundry | any...
 
Migrating a Cloud Foundry from VMware to OpenStack | anynines
Migrating a Cloud Foundry from VMware to OpenStack | anyninesMigrating a Cloud Foundry from VMware to OpenStack | anynines
Migrating a Cloud Foundry from VMware to OpenStack | anynines
 
Building a European PaaS | anynines
Building a European PaaS | anyninesBuilding a European PaaS | anynines
Building a European PaaS | anynines
 

Recently uploaded

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 

Recently uploaded (20)

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 

Beyond 1000 bosh Deployments

  • 1.
  • 2. Beyond 1000 BOSH Deployments 2
  • 3. ~/ → Sven ● At anynines since 2015 ● Ordained Minister ● Working with Cloud Foundry, BOSH and K8s ● Twitter @ShalahAllier 3
  • 4. Also Known For Mentor in the EVE: Online Alliance Test Alliance Please Ignore Also where my slack avatar comes from All the office plants 4
  • 5. The Road to 1200 Deployments 5
  • 6. The a9s Data Service Framework 6
  • 7. BOSH Setup Overbosh: Deploys runtime and Underbosh Underbosh: Deploys service brokers and services (where we have 1200 deployments on) uses credhub colocated on overbosh Utilsbosh: Utilities and Prometheus monitoring 7
  • 8. Lessons Learned ● Deploy less with create-env ● Only create-env utilsbosh ● Deploy other directors with Utilsbosh ● Using Overbosh as Credhub provider for Underbosh can be suboptimal ○ Recreate of the Overbosh means people cannot create services ○ Dedicated Credhub/UAA deployment better ● Using an external RDS does not solve all problems ○ More about that later 8
  • 10. IO Credits are Fun ● AWS has limited IOPS for ssd disks at a rate of 3 IOPS/GB on gp2 ● None on magnetic volumes (st1, sc1, standard) ● AWS-Stage BOSH DB runs on gp2 ● AWS-Prod BOSH DB runs on standard ● You can see disk IOPS budget on the disk in cloud watch ● Unless its an RDS instance ● You have to create an alert for each single volume 10
  • 11. Effects ● Database in AWS-P is consistently slower, but no variation in answer times ● Database in AWS-S went unresponsive at some points ○ BOSH sometimes sends a few thousand requests in which do large joins ● EU-P BOSH has 50GB of standard disk (3$/mo) ● EU-S BOSH has 1TB of GP2 disk (119$/mo) 11
  • 12. Things That Drain Your IOPS ● The daily snapshot task, even if snapshots are disabled ○ Made less severe ● Bosh vms/deployments ○ More later ● If your IOPS on the director disk get depleted repeatedly, take a magnetic storage like sc1 or st1 ○ Slower than gp2 at max speed ○ Costs half ○ Consistent and fast enough for BOSH 12
  • 13. Some issues we had 13
  • 14. September 2018 ● 670 Deployments ● BOSH director is very slow ● Some queries take 2-3 minutes to complete ● Scaling BOSH and DB does only bring minor reductions ● M4.2xlarge RDS is a bit faster, but does not solve it ● More disk IOPS does not help 14
  • 15. Solution ● Updating the director ● Reason was that every bosh vms made the bosh also select deployment configs for each deployment separately ○ Even though it was not part of the output ● SAP stumbled over the issue first and fixed it 15
  • 16. November 2018 ● BOSH unresponsive or very slow ● No uploads/deploys possible ● Persistent disk 50% free ● “df -i” showed all inodes exhausted ● BOSH stores task logs on disk ○ And deletes regularly ○ If you have 900 deployments and prometheus bosh exporter does a bosh vms every 5 minutes you create tasks faster than bosh cleans them up ○ 1.8m task log folders on disk ○ Every one contained 0-3 log files 16
  • 17. Solution ● Removing some older log files (1.79m) ● Scaling the disk ● Notifying BOSH core ● Set up alert for Inode usage on all persistent disks ● Switch from bosh exporter to graphite hm plugin ● BOSH core made the director more aggressive at purging old task logs ○ Went from 1.6m task logs on disk to just 18.000 17
  • 18. December 2018 ● BOSH very slow ● Sometimes locks up for minutes ● Database works on some queries longer than BOSH waits ● Whenever a service is deployed or updated 18
  • 19. Investigation ● Turns out when you use `bosh tasks -r` it queries the last 30 tasks ● We had 3.5m tasks in the DB ● Query: `SELECT * FROM “tasks” WHERE (“deployment_name” = ‘d27eda6’) ORDER BY “timestamp” DESC LIMIT 30` ○ No index on deployment_name ○ So if only 29 tasks are there it crawls through all 3.5m lines to find task 30. ○ Most deployments have less than 30 tasks in the DB 19
  • 20. Solution ● Change to -r=1 ● Make a deploy task for each deployment to make sure there is one task ● Issue with BOSH core (No. 2105) ● BOSH core fix: ○ BOSH deletes old tasks faster so you have less (10 instead of 2 in each run) ○ Put an index on task types ○ 3.5m tasks > 1100 tasks in the DB 20
  • 22. Things You Should Monitor ● Network IP exhaustion ○ IaaS dependent, but running out of IPs during deploys is suboptimal ○ Especially when customer notices first ● Disk IOPS (depending on IaaS) ● Quota limitations ○ Record holder is azure where a limit increase took 9 days ● CPU credits on important instances ● Disk inode usage, not just how full it is in terms of data ● Certificate expiration ● Check if metrics are missing 22
  • 23. What 1200 deployments taught us ● BOSH team is usually rather fast at fixing issues that block the director ● BOSH itself is pretty stable ● Change from the Prometheus bosh exporter to the graphite hm plugin ● For most smaller to medium environments t2.large (2cpu, 8GB ram with burst CPU) or equivalent is plenty ● For large environments a m5.xlarge or m5.2xlarge is enough ○ Disk IO/Network speed will most likely be the bottleneck 23
  • 24. Advice ● Don’t overdo it on the worker count ○ Our biggest director still has only 9 workers for tasks ○ The others have usually 3-4 workers ● Otherwise you run the risk of CPU starving yourself when you use all workers simultaneously 24