The document discusses how the networking team at Lyft hides the complexity of Envoy to allow service developers to leverage its power. It describes the team's work over time to open source Envoy components, build tools like Envoymanager to make configuration accessible via APIs and stored documents, and provide documentation and observability tools to help developers configure and debug Envoy. The focus is on improving developer productivity by simplifying Envoy usage and management.
3. My time at Lyft
1. Initial Envoy open sourcing: documentation, and docker sandbox examples
2. Create Envoyoutbound: enable developers to easily communicate with partners
over stable IPs
3. Open sourcing ratelimit, and a couple other golang libraries: provide ample
documentation for consumers
4. Expand Envoy’s outlier detection system, and build tooling (stats, logging) to
help developers understand anomalies in their services
5. xDS APIs and the future of Envoy configuration management at Lyft: how do we
make the control plane accessible and easy to use
@junr03
4. There is a pattern...
1. Open sourcing envoy: documentation, and docker sandbox examples
2. Create Envoyoutbound: enable developers to easily communicate with
partners over stable IPs
3. Open sourcing ratelimit, and a couple other golang libraries: provide ample
documentation for consumers
4. Expand Envoy’s outlier detection system, and build tooling (stats, logging) to
help developers understand anomalies in their services
5. xDS APIs and the future of Envoy configuration management at Lyft: how do we
make the control plane accessible and easy to use
The focus is on developer productivity!
@junr03
5. The Story
Envoy is a powerful and complex tool.
How does the Networking Team at
Lyft hide the complexity to allow
service developers to leverage the
power of Envoy?
@junr03
6. Why is this important?
• Lyft engineers are the Infra org’s customers
• Lyft is about to have a lot more engineers
• The number of services at Lyft is ever increasing
@junr03
7. Frame of Reference - The Control Plane
• Proxy configuration is complicated: envoy is not the exception
• Operating the data plane should be reserved to a select few
• Configuring some options of the data plane via the control plane should be
open to all service owners
@junr03
9. Envoy Design Goals
1. Out of process architecture
2. Low latency, high performance, dev productivity
3. Filter Architecture: L3/L4 & L7
4. HTTP/2 first
5. Service/Config discovery
6. Active/passive health checking
7. Advanced load balancing
8. Envoy everywhere
9. Best in class observability
@junr03
10. Envoy Rollout - Edge Proxy
AWS TCP
ELB
Service Foo
Service Bar
Service Baz
/foo
/bar
/baz
1. Microservice architectures need an edge proxy
2. Easy to show value with:
‒ Stats
‒ Enhanced Load Balancing
‒ Routing
‒ Protocols
@junr03
11. Envoy Rollout - TCP Proxy / MongoDB
1. Parse Mongo at L7 and get useful stats
2. Ratelimit Mongo to avoid death spirals
3. Better connection handling than to raw
Mongo
4. We can do this with all services!
Global Ratelimit ServiceMongo Router
/ /
Mongo
DB
@junr03
12. Envoy Rollout - Service Sidecar
AWS TCP
ELB
1. Proxying to Mongo meant all Services already had Envoy running
2. Still used internal ELB for service-to-service traffic
3. Use for:
‒ Ingress buffering
‒ Circuit Breaking
‒ Observability
AWS TCP
ELB
/ /
@junr03
13. Envoy Rollout - Service Mesh
1. Direct Connect
2. Service Discovery
Cron job
Discovery
/ /
Cron job
@junr03
16. Configuration Management - The Past
Initially static files
‒ Only two types: edge proxy, service sidecar
‒ Deployed on a deploy bundle out to the edge proxy, and to all services in the mesh
Human Static Files
“Deploy
Magic”
Proxies
@junr03
17. Configuration Management - The Past
As complexity grew we moved to templated files
‒ Jinja2 templates, and some python glue
‒ Expose certain “knobs” to the service engineers at Lyft
‒ At deploy time, create the configuration file
Human
Exposed
Knobs
“Deploy
Magic”
Proxies
Jinja2
Templates
+
@junr03
18. Use case: create a new public route
• Service developers manipulate edge proxy route table
• Deploying public routing changes was tied to an Envoy binary deployment
• Erroneous configuration could be deployed next to complex code
Front Envoy
/new/route
New Service
@junr03
19. Pain points
• No clear ownership
• Configuration deployment was tied to binary deployment
• UX is tedious and fragmented
The Complexity is in Plain Sight
@junr03
20. Configuration Management - The Present
Mid 2017: xDS APIs for configuration management.
• gRPC/protobuf based
• Bi-directional gRPC streaming
• Interacting with the control plane is separated from data plane operation
• Enable us to develop smart, robust control plane solutions
RDS - Route Discovery Service
CDS - Cluster DS
LDS - Listener DS
...
@junr03
21. Configuration Management - The Present
Envoymanager
/ /
service
deployment
envoy-static-config
service
“manifest”
Document
Cloud Storage
@junr03
22. Configuration Management - The Present
envoy-static-config
service
“manifest”
match:
path: /rider/
route:
cluster: pagelauncher
@junr03
38. Wins
• Allows service developers to own configuration changes all the way to
production
• Most configuration changes do not entail an envoy restart
• Most configuration changes do not entail an envoy binary deploy
• Opens up the world to more friendly UX for configuration changes
@junr03
40. The networking team focuses on building
accessible and easy-to-use systems for
service developers to successfully
configure, operate, and debug Envoy
@junr03
I am an Envoy Maintainer, but I am also a software engineer in Lyft’s networking team.
So I am in an interesting spot, because I help write Envoy, but I also have to operate it, and productionalize it for the rest of the engineering org at Lyft.
I wanted to show you my timeline because I think that a very clear pattern emerges.
As infrastructure developers we need to enable developers so that they can execute fast in a reliable manner.
We need to provide great, and clear documentation. We need to provide easy to follow examples. We need to build tooling that is accessible and easy to use.
Today I have focused on configuration management but the networking team does a great deal of to accelerate developer productivity:
Default dashboards
Access logging
Tracing
DoS protection