The Cloud Foundry Diego team at Pivotal has been hard at work for the past few months exploring and improving Diego's performance at scale and under stress. This talk covers the goals, tools, and results of the experiments to date, as well as a glimpse of what's next.
And finally, a brief teaser about the current state of .NET support in Diego
3. Who’s this guy?
• Berkeley math grad school… dropout
• Rails consulting… deserter
• now I do BOSH, Cloud Foundry, Diego, etc.
4. Testing Diego Performance at Scale
• current Diego architecture
• performance testing approach
• test specifications
• test implementation and tools
• results
• bottom line
• next steps
6. Current Diego Architecture
What’s new-ish?
• consul for service discovery
• receptor (API) to decouple from CC
• SSH proxy for container access
• NATS-less auction
• garden-windows for .NET applications
7. Current Diego Architecture
Main components:
• etcd ephemeral data store
• consul service discovery
• receptor Diego API
• nsync sync CC desired state w/Diego
• route-emitter sync with gorouter
• converger health mgmt & consistency
• garden containerization
• rep sync garden actual state w/Diego
• auctioneer workload scheduling
8. Performance Testing Approach
• full end-to-end tests
• do a lot of stuff:
– is it correct, is it performant?
• kill a lot of stuff:
– is it correct, is it performant?
• emit logs and metrics (business as usual)
• plot & visualize
• fix stuff, repeat at higher scale*
11. Test Specifications
• Diego does tasks and long-running processes
• launch 10n, …, 400n tasks:
– workload distribution?
– scheduling time distribution?
– running time distribution?
– success rate?
– growth rate?
• launch 10n, …, 400n-instance LRP:
– same questions…
12. Test Specifications
• Diego+CF stages and runs apps
• > cf push
• upload source bits
• fetch buildpack and stage droplet (task)
• fetch droplet and run app (LRP)
• dynamic routing
• streaming logs
13. Test Specifications
• bring up n nodes in parallel
– from each node, push a apps in parallel
– from each node, repeat this for r rounds
• a is always ≈ 20
• r is always = 40
• n starts out = 1
14. Test Specifications
• the pushed apps have varying characteristics:
– 1-4 instances
– 128M-1024M memory
– 1M-200M source code payload
– 1-20 log lines/second
– crash never vs. every 30 s
15. Test Specifications
• starting with n=1:
– app instances ≈ 1k
– instances/cell ≈ 100
– memory utilization across cells ≈ 90%
– app instances crashing (by-design) ≈ 10%
16. Test Specifications
• evaluate:
– workload distribution
– success rate of pushes
– success rate of app routability
– times for all the things in the push lifecycles
– crash recovery behaviour
– all the metrics!
17. Test Specifications
• kill 10% of cells
– watch metrics for recovery behaviour
• kill moar cells… and etcd
– does system handle excess load gracefully?
• revive everything with > bosh cck
– does system recover gracefully…
– with no further manual intervention?
27. Results
From the 400-task request from “Fezzik”:
• only 3-4 (out of 10) API nodes handle reqs?
• recording task reqs take increasing time?
• submitting auction reqs sometimes slow?
• later auctions take so long?
• outliers wtf?
• container creation takes increasing time?
28. Results
• only 3-4 (out of 10) API nodes handle reqs?
– when multiple address requests during DNS lookup, Golang
returns the DNS response to all requests; this results in only 3-4
API endpoint lookups for the whole set of tasks
• recording task reqs take increasing time?
– API servers use an etcd client with throttling on # of concurrent
requests
• submitting auction reqs sometimes slow?
– auction requests require API node to lookup auctioneer address
in etcd, using throttled etcd client
29. Results
• later auctions take so long?
– reps were taking longer to report their state to auctioneer,
because they were making expensive calls to garden,
sequentially, to determine current resource usage
• outliers wtf?
– combination of missing logs due to papertrail lossiness, +
cicerone handling missing data poorly
• container creation takes increasing time?
– garden team tasked with investigation
30. Results
Problems can come from:
• our software
– throttled etcd client
– sequential calls to garden
• software we consume
– garden container creation
• “experiment apparatus” (tools and services):
– papertrail lossiness
– cicerone sloppiness
• language runtime
– Golang’s DNS behaviour
34. Results
• for the fastest pushes
– dominated by red, blue, gold
– i.e. upload source & CC emit “start”, staging process,
upload droplet
• pushes get slower
– growth in green, light blue, fucsia, teal
– i.e. schedule staging, create staging container,
schedule running, create running container
• main concern: why is scheduling slowing down?
35. Results
• we had a theory (blame app log chattiness)
• reproduced experiment in BOSH-Lite
– with chattiness turned on
– with chattiness turned off
• appeared to work better
• tried it on AWS
• no improvement
36. Results
• spelunked through more logs
• SSH’d onto nodes and tried hitting services
• eventually pinpointed it:
– auctioneer asks cells for state
– cell reps ask garden for usage
– garden gets container disk usage bottleneck
41. Results
• cells heartbeat their presence to etcd
• if ttl expires, converger reschedules LRPs
• cells may reappear after their workloads have
been reassigned
• they remain underutilized
• but why do cells disappear in the first place?
• added more logging, hope to catch in n=2 round
42. Results
With the one lingering question about cell disappearnce, on to n=2
#1: #2:
#3: #4:
x 1
#1: #2:
#3: #4:
x 2
#1: #2:
#3: #4:
x 5
#1: #2:
#3: #4:
x 10
✓✓
✓ ✓
?
45. Results
• we added a story to the garden backlog
• the serial request issue was an easy fix
• then, with n=2 parallel test-lab nodes, we
pushed 2x the apps
– things worked correctly
– system was performant as a whole
– but individual components showed signs of scale
issues
47. Results
• nsync fetches state from CC and etcd to make
sure CC desired state is reflected in diego
• converger fetches desired and actual state
from etcd to make sure things are consistent
• route-emitter fetches state from etcd to keep
gorouter in sync
• bulk loop times doubled from n=1
54. Updates on .NET Support
• what’s currently supported?
– ASP.NET MVC
– nothing too exotic
– most CF/Diego features, e.g. security groups
– VisualStudio plugin, similar to the Eclipse CF plugin for
Java
• what are the limitations?
– some newer Diego features, e.g. SSH
– in α/β stage, dev-only
55. Updates on .NET Support
• what’s coming up?
– make it easier to deploy Windows cell
– more VisualStudio plugin features
– hardening testing/CI
• further down the line?
– remote debugging
– the “Spring experience”
56. Updates on .NET Support
• shout outs
– CenturyLink
– HP
• feedback & questions?
– Mark Kropf (PM): mkropf@pivotal.io
– David Morhovich (Lead): dmorhovich@pivotal.io