Beyond 1000 bosh Deployments

Beyond 1000 BOSH Deployments
2

~/
→ Sven
● At anynines since 2015
● Ordained Minister
● Working with Cloud Foundry, BOSH and K8s
● Twitter @ShalahAllier
3

Also Known For
Mentor in the EVE: Online Alliance Test Alliance Please Ignore
Also where my slack avatar comes from
All the office plants
4

The Road to 1200 Deployments
5

The a9s Data Service Framework
6

BOSH Setup
Overbosh: Deploys runtime and Underbosh
Underbosh: Deploys service brokers and services (where we have 1200
deployments on) uses credhub colocated on overbosh
Utilsbosh: Utilities and Prometheus monitoring
7

Lessons Learned
● Deploy less with create-env
● Only create-env utilsbosh
● Deploy other directors with Utilsbosh
● Using Overbosh as Credhub provider for Underbosh can be
suboptimal
○ Recreate of the Overbosh means people cannot create services
○ Dedicated Credhub/UAA deployment better
● Using an external RDS does not solve all problems
○ More about that later
8

IO Credits are Fun
● AWS has limited IOPS for ssd disks at a rate of 3 IOPS/GB on gp2
● None on magnetic volumes (st1, sc1, standard)
● AWS-Stage BOSH DB runs on gp2
● AWS-Prod BOSH DB runs on standard
● You can see disk IOPS budget on the disk in cloud watch
● Unless its an RDS instance
● You have to create an alert for each single volume
10

Effects
● Database in AWS-P is consistently slower, but no variation in answer
times
● Database in AWS-S went unresponsive at some points
○ BOSH sometimes sends a few thousand requests in which do large joins
● EU-P BOSH has 50GB of standard disk (3$/mo)
● EU-S BOSH has 1TB of GP2 disk (119$/mo)
11

Things That Drain Your IOPS
● The daily snapshot task, even if snapshots are disabled
○ Made less severe
● Bosh vms/deployments
○ More later
● If your IOPS on the director disk get depleted repeatedly, take a
magnetic storage like sc1 or st1
○ Slower than gp2 at max speed
○ Costs half
○ Consistent and fast enough for BOSH
12

September 2018
● 670 Deployments
● BOSH director is very slow
● Some queries take 2-3 minutes to complete
● Scaling BOSH and DB does only bring minor reductions
● M4.2xlarge RDS is a bit faster, but does not solve it
● More disk IOPS does not help
14

Solution
● Updating the director
● Reason was that every bosh vms made the bosh also select
deployment configs for each deployment separately
○ Even though it was not part of the output
● SAP stumbled over the issue first and fixed it
15

November 2018
● BOSH unresponsive or very slow
● No uploads/deploys possible
● Persistent disk 50% free
● “df -i” showed all inodes exhausted
● BOSH stores task logs on disk
○ And deletes regularly
○ If you have 900 deployments and prometheus bosh exporter does a bosh vms every 5
minutes you create tasks faster than bosh cleans them up
○ 1.8m task log folders on disk
○ Every one contained 0-3 log files
16

Solution
● Removing some older log files (1.79m)
● Scaling the disk
● Notifying BOSH core
● Set up alert for Inode usage on all persistent disks
● Switch from bosh exporter to graphite hm plugin
● BOSH core made the director more aggressive at purging old task
logs
○ Went from 1.6m task logs on disk to just 18.000
17

December 2018
● BOSH very slow
● Sometimes locks up for minutes
● Database works on some queries longer than BOSH waits
● Whenever a service is deployed or updated
18

Investigation
● Turns out when you use `bosh tasks -r` it queries the last 30 tasks
● We had 3.5m tasks in the DB
● Query: `SELECT * FROM “tasks” WHERE (“deployment_name” =
‘d27eda6’) ORDER BY “timestamp” DESC LIMIT 30`
○ No index on deployment_name
○ So if only 29 tasks are there it crawls through all 3.5m lines to find task 30.
○ Most deployments have less than 30 tasks in the DB
19

Solution
● Change to -r=1
● Make a deploy task for each deployment to make sure there is one
task
● Issue with BOSH core (No. 2105)
● BOSH core fix:
○ BOSH deletes old tasks faster so you have less (10 instead of 2 in each run)
○ Put an index on task types
○ 3.5m tasks > 1100 tasks in the DB
20

Things You Should Monitor
● Network IP exhaustion
○ IaaS dependent, but running out of IPs during deploys is suboptimal
○ Especially when customer notices first
● Disk IOPS (depending on IaaS)
● Quota limitations
○ Record holder is azure where a limit increase took 9 days
● CPU credits on important instances
● Disk inode usage, not just how full it is in terms of data
● Certificate expiration
● Check if metrics are missing
22

What 1200 deployments taught us
● BOSH team is usually rather fast at fixing issues that block the
director
● BOSH itself is pretty stable
● Change from the Prometheus bosh exporter to the graphite hm plugin
● For most smaller to medium environments t2.large (2cpu, 8GB ram
with burst CPU) or equivalent is plenty
● For large environments a m5.xlarge or m5.2xlarge is enough
○ Disk IO/Network speed will most likely be the bottleneck
23

Advice
● Don’t overdo it on the worker count
○ Our biggest director still has only 9 workers for tasks
○ The others have usually 3-4 workers
● Otherwise you run the risk of CPU starving yourself when you use all
workers simultaneously
24

Keep in Touch
anynines.com
@anynines
26
@ShalahAllier

Beyond 1000 bosh Deployments

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Beyond 1000 bosh Deployments

Similar to Beyond 1000 bosh Deployments (20)

More from anynines GmbH

More from anynines GmbH (20)

Recently uploaded

Recently uploaded (20)

Beyond 1000 bosh Deployments