3. 3
2 May 2017 (v0.8.0)
First public release
148 commits
801 lines of Python
Jinja2, EnvoyV1 + hot restart
REST & PostgreSQL(!!)
Basic routing, HTTP, HTTPS
As of 20 May 2019 (v0.70.0)
Thousands of active installations
6943 commits, 70 contributors
11K lines of Python
3K lines of Go
EnvoyV2 + ADS
K8s annotations + CRDs
HTTP/HTTPS/gRPC/
websockets/TCP, canaries,
shadowing, rate limiting, circuit
breaking……
By the Numbers
4. 4
How Did We Get Here?
“Life can only be understood
backwards, but it must be lived
forward.”
(Søren Kierkegaard)
Software
created
Phase Releases Dates Why
The
Experiment
0.1.3 - 0.11.0 2017-03 – 2017-09 “What should we do? For whom?”
Productization 0.11.0 - 0.19.0 2017-09 – 2017-11
“Oh crap our experiment is
succeeding!”
Features 0.19.0 - 0.40.2 2017-11 – 2018-11
“Oh crap we need compelling
reasons for adoption!”
The Grand
Refactor
0.50.0 2018-08 – 2019-01
“Oh crap we let all that technical
debt get out of control!"
The Balancing
Act
0.50.0 – present 2019-01 – present
“Oh crap let’s not have another
Grand Refactor.”
5. 5
The Experiment: 0.1.3 - 0.11.0
“What should we do? For whom?”
When: March 2017 – September 2017
The Trigger: Matt Klein’s Envoy talk at the
Microservices Practitioners’ Summit, January 2017
Important For: learning things!
Many “wrong” technical choices here
All of them taught us things
🤔
6. 6
The Experiment:Where We Started
Some experience with Kubernetes but none with Envoy
Idea: other K8s developers probably have the same pain
points we do
Idea: other K8s developers probably also share clusters
Idea: one of the first things K8s developers wrestle
with is ingress to their service
Hypothesis: we can use Envoy to lessen the pain, but
Hypothesis: K8s users don’t think in the terms Envoy
configs want them to think
🤔
7. 7
The Experiment:The Personas
By July 2017 we’d refined our ideas about users into our
primary user persona, Jane:
Microservices app developer with stuff to get done
Works with several others and shares a K8s cluster
Wants to focus on solving her business problems, in
whatever language she wants
Views needing to stress about infrastructure as friction
Embraces Kubernetes as useful, but might not have much
experience with it yet
Very busy, so not much patience for steep learning curves
🤔
8. 8
The Experiment:The Personas
Later on came Julian, our persona for Jane’s more
ops-focused counterpart.
Personas are great tools:
Communication and alignment
Hooks to get developers thinking about people
rather than code
We design Ambassador’s UX around Jane and Julian.
🤔
9. 9
The Experiment:Configuration
Jane needs incremental configuration
Envoy needs a complete config at all times
Ambassador has to own the Envoy config and keep it
up to date
Started out using REST APIs for config updates
PostgreSQL for persistence
Generate EnvoyV1 and trigger a hot restart for
every change
Terrible idea!
🤔
10. 10
The Experiment:Statefulness
High-availability statefulness is hard – especially in
Kubernetes!
“Oh crap, the database pod crashed again!”
StatefulSets? Persistent volumes?Wait what??
Realized that Kubernetes already does this
First design used ConfigMaps
Shifted quickly to annotations, then to CRDs
Enormous flexibility without reinventing wheels
🤔
11. 11
The Experiment:Ingress
“How does Jane get traffic to her services?”
Figured we’d just make a Kubernetes ingress
controller, but learned rapidly that:
The Ingress resource isn’t expressive enough
Least common denominator problem
Still true two years on
Current practice is to use annotations, but then
why bother?
Ingress interactions with cloud providers are tricky
🤔
12. 12
The Experiment:Ingress
Again, let Kubernetes do the hard stuff:
Ambassador can just deploy as a Service
Two-tier LB model
Permits easily offloadingTLS termination
Permits trivially scaling Ambassador horizontally
Let Ambassador worry about edge policy
🤔
13. 13
The Experiment:Life at the Edge
Particularly nasty example: LB terminatingTLS, so
handing Ambassador a cleartext connection
Ambassador needs the original client info for policy
Envoy supports PROXY, X-Forwarded-For,
X-Forwarded-Proto
Figuring out how to make them work together is
complex
Figuring out how to explain it to Julian is worse!
Even messier if it’s an L4 load balancer
🤔
14. 14
The Experiment:Conclusions
Phenomenal opportunity to explore
without messing up people relying on you.
Form hypotheses, collect data, draw conclusions…
then do it again.
Lots of iteration, mistakes, and corrections during
these six months.
What we learned here is still the core of
Ambassador.
🤔
15. 15
Productization: 0.11.0 - 0.19.0
“Oh crap, our experiment is succeeding!”
When: September 2017 - November 2017
The Trigger: starting to get users – and questions!
“What version of Envoy is in Ambassador 0.12.0?”
“When will you support Envoy 1.4?”
“What tests do you have for this feature?”
Important For: Groundwork for having a product
instead of an experiment
😱
16. 16
Productization:What Changed
Basically had to retrofit release engineering!
Not relevant for experiments; critical for products
Lots of investment in automation
Also applied to UX
Annotations instead of ConfigMaps
Diagnostics UI
0.19.0: first recognizably-modern Ambassador
😱
17. 17
Productization:Conclusions
A successful experiment turns into a product
Users start to trust it, which means you have to be
trustworthy!
This might only be visible in hindsight
Always critical to stay in touch with users
Need to be able to react quickly
😱
18. 18
Features: 0.19.0 - 0.42.2
“Oh crap, we need compelling things to
drive adoption!”
When: November 2017 – November 2018
The Trigger: we figured we had release
engineering under control, and we wanted all the
users
Important For: Decisions and tradeoffs around
further development
😁
19. 19
Features:The Fundamental Tradeoff
Spent a full year going after features
Deliberately chose to trade technical debt for time to
market
Limitations of our existing Jinja2 template engine
were definitely visible by 2018
Knew that we’d never get the chance to fix them
without market share
😅
20. 20
Features:Growth
Ambassador grew a lot during this phase
Added two engineers at Datawire!
First external contribution, from Alex Gervais at
AppDirect!
Had to put serious effort into support for
external contributors
Zipkin! Helm! Envoy updates! LightStep!
Websockets!Traffic shadowing! and much more!
😁
21. 21
Features:Growth
Added many more production users of Ambassador
Drove work on performance and reliability
Drove work on notifying users of upgrades
Drove us to start thinking longer-term!
Upcoming compatibility issues, roadmap, etc.
😁
22. 22
Features:Hitting Limits
Remember how we chose to incur technical debt?
We hadn’t architected for Ambassador’s growth
Chose to focus on features instead of rearchitecting
Got features to market, but cost us in development
Also needed to move to EnvoyV2 API
V2 shipped in December 2017 with Envoy 1.5
More and more features blocked onV2 support
Getting closer toV1 end of life
😳
23. 23
The Grand Refactor: 0.50.0
“Oh crap, we let all that technical debt get
out of control!”
When: August 2018 – January 2019
The Trigger: velocity had plummeted by mid-2018
Important For: recovering velocity and building
something that would last for awhile
🤯
24. 24
The Grand Refactor:Goals
Rebuild more as a compiler rather than a template
engine
Pretty obvious path for awhile
Support EnvoyV2 + ADS (no more hot restarts!)
Arrange for small Envoy changes to only need small
Ambassador changes
Build tests that were faster to write and to run
🤯
25. 25
The Grand Refactor:Result
Shipped as Ambassador 0.50.0 in January 2019.
Five months start to GA; we’d expected about three.
Lots of simultaneously-moving parts
Being sure that behavior didn’t regress was
exceptionally hard (many thanks to the community
members who helped us with this!)
Although we did speed the tests up, we still have a
lot of pain around testing
😌
26. 26
The Balancing Act: after 0.50.0
“Oh crap, let’s not have another Grand
Refactor.”
When: January 2019 and ongoing!
The Trigger: time to move forward again!
The point of Grand Refactor was to speed
development up again
No substantial new features for five months is an
unpleasant place to be
🙂
27. 27
The BalancingAct:The Present
Shipping the Grand Refactor was a long and painful
process, but it’s paying off:
V2 + ADS is much nicer than hot restart.
New codebase seems faster for development
Seeing more external contributions, too
SNI, TCPMappings, endpoint routing, Consul
discovery, CRDs… all of these would’ve been
impossible before the refactor
🙂
28. 28
The BalancingAct:Near Future
Still investing a lot in testing
Single biggest pain point, within Datawire and
without
Learning curve, fragility, performance
Also investing in release engineering
Tracking Envoy is much easier now
Working to balance features against paying down
technical debt
🙂
29. 29
2019 and Beyond
Biggest things we’ve learned over the last couple of
years:
The edge really is much more complex than the
interior; there’s a lot going on at that boundary
The inner dev loop matters, but
Focusing on your users matters more.
30. 30
The Crystal Ball
Low barrier to entry stays critical
Performance becomes more important
Debuggability and transparency are huge things on our
radar
Expose more Envoy features (e.g. multiple listeners, first-
class gRPC, …)
Protect Envoy from bad configurations
Be ready when EnvoyV3 happens, and if rapid-turnaround
security fixes happen
Maybe next-gen Ingress will happen? someday?